“The increase of disorder or entropy with time is one example of what is called an arrow of time, something that distinguishes the past from the future, giving a direction to time.”
– A Brief History of Time, Stephen Hawking
In this appendix we provide proofs for several facts, mentioned in Chapter 3, which are related to differential entropy, how it evolves under diffusion processes, and its connections to lossy coding. We will make the following mild assumption about the random variable representing the data source, denoted .
is supported on a compact set of radius at most , i.e., .
In particular, since compact sets in Euclidean space are bounded, it holds . We will consistently use the notation to denote the Euclidean ball of radius centered at . In this sense, Assumption B.1 has .
Notice that this assumption holds for (almost) all variables we care about in practice, as it is (often) imposed by a normalization step during data pre-processing.
In this short appendix we discuss the differential entropy of low-dimensional distributions. By definition, the differential entropy of a random variable which does not have a density is ; this includes all random variables supported on low-dimensional sets. The objective of this section is to discuss why this is a “morally correct” value.
In fact, let be any random variable such that Assumption B.1 holds, the support of has volume.111Formally this means that is Borel measurable with Borel measure . We will consider the case that is uniform on .222Say, w.r.t. the Hausdorff measure on . Our goal is to compute .
In this case, would not have a density; in the counterfactual world where we did not know , we could not directly define it using the standard definition of differential entropy. Instead, as in the rest of analysis and information theory it would be reasonable to consider the limit of entropies of successively better approximations of which have densities, i.e.,
(B.1.1) |
To this end, the basic idea is to take an -thickening of , say defined as
(B.1.2) |
and visualized in Figure B.1.
We will work with random variables whose support is , which is fully-dimensional, and take the limit as . Indeed, define . Since has positive volume, has a density equal to
(B.1.3) |
Computing the entropy of using the convention that , it holds
(B.1.4) | ||||
(B.1.5) | ||||
(B.1.6) | ||||
(B.1.7) |
Since is compact is finite and tends to as . Thus
(B.1.8) |
as desired.
The above calculation is actually a corollary of a much more famous and celebrated set of results about the maximum possible entropy of subject to certain constraints on the distribution of . We would be remiss to not provide the results here; the proofs are provided in Chapter 2 of [PW22], for example.
Let be a random variable on .
If is supported on a compact set (i.e., Assumption B.1) then
(B.1.9) |
If has finite covariance such that, for a PSD matrix , it holds (w.r.t. the PSD ordering, i.e., is PSD), then
(B.1.10) |
If has finite second moment such that, for a constant , it holds , then
(B.1.11) |
In the main body (Chapter 3), we considered a random variable , and a stochastic process defined by (3.2.1), i.e.,
(B.2.1) |
where independently of .
The structure of this section is as follows. In Section B.2.1 we provide a formal theorem and crisp proof which shows that under Equation B.2.1 the entropy increases, i.e., . In Section B.2.2 we provide a formal theorem and crisp proof which shows that under Equation B.2.1, the entropy decreases during denoising, i.e., for all . In Section B.2.3 we provide proofs for technical lemmas that are needed to establish the claims in the previous subsections.
Before we start, we introduce some key notations. First, let be the density of , i.e.,
(B.2.2) |
Next, is supported on all of , so it has a density, which we denote (as in the main body). A quick calculation shows that
(B.2.3) |
and from this representation it is possible to deduce (i.e., from Proposition B.4) that is smooth (i.e., infinitely differentiable) in , also smooth in , and positive everywhere. This fact is somewhat remarkable at first sight: even for a completely irregular random variable (say, a Bernoulli random variable, which does not have a density), its Gaussian smoothing admits a density for every (arbitrarily small) . The proof is left as an exercise for readers well-versed in mathematical analysis.
However, we also need to add an assumption about the smoothness of the distribution of , which will eliminate some technical problems that occur around with low-dimensional distributions.333As then various quantities become highly irregular and dealing with them would require significant additional analysis. Despite this, we expect that our results hold under milder assumptions with additional work. For now, let us assume:
has a twice continuously differentiable density, denoted .
In this section appendix we provide a proof of Theorem B.2. For convenience, we restate it as follows.
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). Then
(B.2.4) |
Before we start, we must ask: when does the inequality in (B.2.4) make sense? We will show in Lemma B.1 that under our assumptions, the differential entropy is well-defined, is never , and for is finite, so the (strict) inequality in (B.2.4) makes sense.
The question of well-definedness aside, the crux of this proof is to show that the density of satisfies a particular partial differential equation, which is very similar to the heat equation. The heat equation is a famous PDE which describes the diffusion of heat through space. This intuitively should make sense, and paints a mental picture: as the time increases, the probability from the original (perhaps tightly concentrated) disperses across all of like heat radiating from a source in a vacuum.
Such PDEs for , known as Fokker-Planck equations for more general stochastic processes, are very powerful tools, as they allow us to describe the instantaneous temporal derivatives of in terms of the instantaneous spatial derivatives of , and vice versa, providing a concise description of the regularity and dynamics of . Once we obtain dynamics for , we can then use the system to obtain another one which describes the dynamics of , which after all is just a functional of .
The description of the PDE involves a mathematical object called the Laplacian . Recall from your multivariable calculus class that the Laplacian operating on a differentiable-in-time and twice-differentiable-in-space function is given by
(B.2.5) |
Namely, from using the integral representation of and differentiating under the integral, we can compute the derivatives of (which we do in Proposition B.1) and observe that satisfies the heat-like PDE
(B.2.6) |
Then for finding the dynamics of , we can use Proposition B.3 again as well as the heat-like PDE to get
(B.2.7) | ||||
(B.2.8) | ||||
(B.2.9) | ||||
(B.2.10) |
By using a slightly involved integration by parts argument (Lemma B.2), we obtain
(B.2.11) | ||||
(B.2.12) | ||||
(B.2.13) |
where strict inequality holds in the last line because, for it to not hold, would need to vanish almost everywhere (i.e., everywhere except possibly on a set of zero volume), but this would imply that would be constant almost everywhere, a contradiction with a fact that is a density.
To complete the proof we just use the fundamental theorem of calculus
(B.2.14) |
which proves the claim. (Note that this does not make sense when , which can only happen when and , but in this case so the claim is vacuously true anyways.) ∎
Recall that in Section 3.2.1 we start with the random variable and iteratively denoise it using iterations of the form
(B.2.15) |
for with and . We wish to prove that , showing that the denoising process actually reduces the entropy.
Before we go about doing this, we make several remarks about the problem statement. First, Tweedie’s formula (3.2.20) says that
(B.2.16) |
which likens a full denoising step from time to time to a gradient step on the log-density of . Can we get a similar result for the full denoising step from time to time in (B.2.15)? It turns out that indeed we can, and it is pretty simple. By using (B.2.15) and Tweedie’s formula (3.2.20), we obtain
(B.2.17) |
So this iterative denoising step is again a gradient step on the perturbed log-density with a shrunken step size. In particular, this step can be seen as a perturbation of the distribution of the random variable by the score function vector field, suggesting a connection to stochastic differential equations (SDEs) and the theory of diffusion models [SSK+21]. Indeed, a proof of the following result Theorem B.3 can be developed using this powerful machinery and a limiting argument (e.g., following the technical approach in the exposition of [CCL+23]). We will give a simpler proof here, which will use only elementary tools and thereby illuminate some of the key quantities behind the process of entropy reduction via denoising. On the other hand, we will need to deal with some slightly technical calculations due to the fact that the denoising process in Theorem B.3 does not correspond exactly to the reverse process associated to the noise addition process that generates the observation .444For those familiar with diffusion models, we refer here to the time-reversed forward process not coinciding with the sequence of iterates generated by the process defined by Theorem B.3. These processes coincide in a certain limit where infinitely many steps of Theorem B.3 are taken with infinitely small levels of noise added at each step; for general, finite steps, we must introduce some approximations regardless of the level of sophistication of our tools.
We want to prove that , i.e., formally:
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). Then
(B.2.18) |
This proof uses two main ideas:
First, write down a density for using a change-of-variables formula.
Second, bound this density to control the entropy.
The change of variables is justified by Corollary B.1, which was originally derived in [Gri11].
We execute these ideas now. From Corollary B.1, we obtain that the function defined as is differentiable, injective, and thus invertible on its range, which we henceforth denote . We denote its inverse as . Using a change-of-variables formula, the density of is given by
(B.2.19) |
where (recall, from Section A.2) is the Jacobian of . Since from Lemma B.3 we know is a positive definite matrix, the determinant is positive and so the whole density is positive. Then it follows that
(B.2.20) | ||||
(B.2.21) | ||||
(B.2.22) | ||||
(B.2.23) | ||||
(B.2.24) |
We will study the last term (including the ), and show that it is negative.
By concavity, one has for every . Hence
(B.2.25) | ||||
(B.2.26) | ||||
(B.2.27) | ||||
(B.2.28) | ||||
(B.2.29) |
Now, by the AM-GM inequality on eigenvalues, we have for any symmetric positive definite matrix the bound
(B.2.30) |
which we can apply to the above expression and obtain
(B.2.31) | |||
(B.2.32) | |||
(B.2.33) | |||
(B.2.34) |
From Lemma B.5, it holds (where, recall, is the radius of the support of as in Assumption B.1)
(B.2.35) |
Then it holds
(B.2.36) |
Meanwhile, the function is convex on , so for we have
(B.2.37) | ||||
(B.2.38) |
Here since . In the above bound, we need to verify that the lower bound for is . Indeed,
(B.2.39) | ||||
(B.2.40) |
Notice that this is if and only if , i.e., and , as granted by the assumptions.
Applying this bound, we obtain
(B.2.41) | |||
(B.2.42) | |||
(B.2.43) | |||
(B.2.44) | |||
(B.2.45) |
where the last few lines are the same as in the proof of Theorem B.2. Combining this result with our previous estimate,
(B.2.46) |
where the inequality is strict by the same argument as in Theorem B.2. ∎
Notice that the bounds for and depend on the radius of the data distribution, and are not so general as the bounds in Theorem B.2. The result is actually “as general as needed” in the following sense. Note that if has a twice continuously differentiable density supported on the ball of radius centered at , then it does for , and , and so on, i.e., for any ball of radius . Thus one strategy to get the appropriate denoising guarantee is: fix a data dimension and discretization schedule, and then set (in the analysis) the data radius to be very large such that each denoising step satisfies the requirements for entropy decrease given in Theorem B.3. Then each step of the denoising process will indeed reduce the entropy, as desired.
In this subsection we present technical results which power our main two conceptual theorems. Our presentation will be more or less standard for mathematics; we will start with the higher-level results first, and gradually move back to the more incremental results. The higher-level results will use the incremental results, and in this way we have an easy-to-read dependency ordering of the results: no result depends on those before it. Results which do not depend on each other are generally ordered by the place they appear in the above pair of proofs.
We first show that the entropy exists along the stochastic process and is finite.
Let be any random variable, and let be the stochastic process (B.2.1).
For , the differential entropy exists and is .
If in addition Assumption B.1 holds for , then and .
To prove Lemma B.1.1, we use a classic yet tedious analysis argument. Since has a density, we can write
(B.2.47) |
Accordingly, let be defined as
(B.2.48) |
As usual to bound the value of an integral in analysis, define
(B.2.49) |
Then
(B.2.50) |
and both integrals are guaranteed to be non-negative since their integrands are.
In order to show that is well-defined, we need to show that or . To show that , it merely suffices to show that . To bound the integral of we need to understand the quantity , namely, we want to characterize when is negative.
(B.2.51) |
Thus, it holds that
(B.2.52) |
In order to bound the integral of , we need to show that is “not too concentrated,” namely that is not too large. To prove this, in this case we are lucky enough to be able to bound the function itself. Namely, notice that
(B.2.53) |
which blows up as but is finite for all finite . Therefore
(B.2.54) |
Now there are two cases.
If , then , so the indicator is never , hence identically and its integral is also .
If , then , so since the logarithm is monotonically increasing,
(B.2.55) | ||||
(B.2.56) | ||||
(B.2.57) | ||||
(B.2.58) |
Hence we have , so the differential entropy exists and is .
To prove Lemma B.1.2, suppose that Assumption B.1 holds. We want to show that and . The mechanism for doing this is the same, and involves the maximum entropy result Theorem B.1. Namely, since is absolutely bounded, it has a finite covariance which we will denote . Then the covariance of is . Thus the entropy of and can be upper bounded by the entropy of normal distributions with the respective covariances, i.e., or , and both are . ∎
Finally, we fill in the integration-by-parts argument alluded to in the proofs of Theorems B.2 and B.3. The argument is conceptually pretty simple but requires some technical estimates to show that the boundary term in the integration-by-parts vanishes.
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). For , let be the density of . Then for a constant it holds
(B.2.59) |
The basic idea of this proof is in two steps:
First, apply Green’s theorem to do integration by parts over a compact set;
Second, send the radius of this compact set to , to get integrals over all of .
Green’s theorem says that for any compact set , twice continuously differentiable , and continuously differentiable ,
(B.2.60) |
where denotes an integral over the “surface measure”, i.e., the inherited measure on , namely the boundary of , and accordingly takes values on this surface and is the unit normal vector to at the surface point . Now, taking and , over a ball of radius centered at (so that is the sphere of radius centered at and ):
(B.2.61) | |||
(B.2.62) | |||
(B.2.63) |
Sending , it holds that
(B.2.64) | |||
(B.2.65) | |||
(B.2.66) | |||
(B.2.67) |
where the first inequality follows by dominated convergence on the integrand. It remains to compute the last limit. For this, we take asymptotic expansions of each term. The main device is as follows: for , we have , so
(B.2.68) | ||||
(B.2.69) | ||||
(B.2.70) | ||||
(B.2.71) | ||||
(B.2.72) |
Note that because , we have by Cauchy-Schwarz that
(B.2.73) |
Recall that by Assumption B.1, is supported on a compact set of radius . Thus
(B.2.74) |
In other words, it holds
(B.2.75) |
Now to compute the gradient, we can use Proposition B.1 and linearity of expectation to compute
(B.2.76) | ||||
(B.2.77) | ||||
(B.2.78) | ||||
(B.2.79) | ||||
(B.2.80) |
Using Cauchy-Schwarz and the representation again, it holds
(B.2.81) | |||
(B.2.82) | |||
(B.2.83) |
For (as is suitable, because we are going to take the limit while is fixed), both sides are negative. This makes sense: most of the probability mass is contained within the ball of radius and thus the score points inwards, having a negative inner product with the outward-pointing vector . Thus using the appropriate bounds for ,
(B.2.84) |
Then, noting that , we can compute
(B.2.85) |
So one can see that, letting the surface area of be where is a function of, it holds
(B.2.86) |
and therefore the exponentially decaying tails mean
(B.2.87) |
Finally, plugging into (B.2.64), we have
(B.2.88) | ||||
(B.2.89) |
as claimed. ∎
Here we provide some results used in the proof of Theorem B.3 which are appropriate generalizations of corresponding results in [Gri11].
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). Let be such that , and let . The Jacobian is symmetric positive definite.
We have
(B.2.90) |
Here we expand
(B.2.91) |
So we need to ensure that
(B.2.92) | ||||
(B.2.93) |
is symmetric positive semidefinite. Indeed it is obviously symmetric (by Clairaut’s theorem). To show its positive semidefiniteness, we plug in the expectation representation of given by (B.2.3) (and , by Proposition B.1) to obtain (where is as defined and is i.i.d. as ),
(B.2.94) | |||
(B.2.95) | |||
(B.2.96) | |||
(B.2.97) | |||
(B.2.98) | |||
(B.2.99) | |||
(B.2.100) | |||
(B.2.101) | |||
(B.2.102) | |||
(B.2.103) | |||
(B.2.104) | |||
(B.2.105) | |||
(B.2.106) | |||
(B.2.107) |
Since and are i.i.d., the whole integral (i.e., the original quadratic form) is if and only if and has support entirely contained in an affine subspace which is orthogonal to . But this is ruled out by assumption (i.e., that has a density on ), so the Jacobian is symmetric positive definite. ∎
Let be any differentiable function whose Jacobian is symmetric positive definite. Then is injective, and hence invertible as a function where is the range of .
Suppose that were not injective, i.e., there exists such that while . Define . Define the function as . Then . Since is differentiable, is differentiable, so the derivative must vanish for some by the mean value theorem. However,
(B.2.108) |
since the Jacobian is positive definite. Thus we arrive at a contradiction, as claimed. ∎
Combining the above two results, we obtain the following crucial result.
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). Let be such that , and let . Then is injective, and therefore invertible onto its range.
The only thing left to show is that is differentiable, but this is immediate from Tweedie’s formula (Theorem 3.3) which shows that is differentiable if and only if is differentiable, and this is provided by Equation B.2.3. ∎
Finally, we develop a technical estimate which is required for the proof of Theorem B.3 and actually motivates the assumption for the viable .
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). Let be the density of . Then, for it holds
(B.2.109) |
By chain rule, a simple exercise computes
(B.2.110) |
Using Proposition B.1 to write the terms in , we obtain
(B.2.111) | ||||
(B.2.112) |
This looks like a Bayesian marginalization, so let us define the appropriate normalized density
(B.2.113) |
Then, defining , we can write
(B.2.114) |
Similarly, writing out the second term (non-squared) we obtain
(B.2.115) |
Letting , it holds
(B.2.116) |
Thus writing out fully, we have
(B.2.117) | ||||
(B.2.118) | ||||
(B.2.119) | ||||
(B.2.120) |
A trivial lower bound on this trace is , since covariance matrices are positive semidefinite. To find an upper bound, note that takes values only in the support of (since is a factor of the density of ), which by Assumption B.1 is a compact set with radius . So
(B.2.121) |
Therefore
(B.2.122) |
which shows the claim. ∎
Here we calculate some useful derivatives which will be reused throughout the appendix.
Let be any random variable such that Assumptions B.1 and B.2 hold, and let be the stochastic process (B.2.1). For , let be the density of . Then
(B.2.123) | ||||
(B.2.124) | ||||
(B.2.125) | ||||
(B.2.126) |
We use the convolution representation of , namely (B.2.3). First taking the time derivative, a computation yields that Proposition B.3 applies,555We use , noting that it is twice continuously differentiable in and (more than) twice continuously differentiable in . Then to check the local integrability of we compute , which is is easy to check integrable over and where . (Indeed, has exponentially decaying tails, so the quadratic term in the product is of no issue.) so we can bring the derivative inside the integral/expectation as:
(B.2.127) |
Meanwhile, by properties of convolutions (Proposition B.4) and using the fact that is compactly supported (Assumption B.1),
(B.2.128) |
The rest of the computation follows from Proposition B.2. ∎
For and it holds
(B.2.129) | ||||
(B.2.130) | ||||
(B.2.131) | ||||
(B.2.132) |
Direct computation. ∎
In this appendix, we differentiate under the integral sign many times, and it is important to know when we can do this. There are two kinds of differentiating under the integral sign:
Differentiating an integral with respect to the auxiliary parameter .
Differentiating a convolution with respect to the variable .
For the first category, we give a concrete result, stated without proof but attributable to the linked source, which derives the following result as a special case of a more general theorem about the interaction of differential operators and tempered distributions, much beyond the scope of the book. A full formal reference can be found in [Jon82].
Let be such that:
is a jointly measurable function of ;
For Lebesgue-almost every , the function is absolutely continuous;
is locally integrable, i.e., for every it holds
(B.2.133) |
Then is an absolutely continuous function on , and its derivative is
(B.2.134) |
defined for almost every .
For the second category, we give another concrete result, stated without proof but fully formalized in [BB11].
Let be -times continuously differentiable with compact support, and let be locally integrable. Then the convolution defined by
(B.2.135) |
is -times continuously differentiable, and its derivative of order is
(B.2.136) |
Although not in the book, a simple integration by parts argument shows that if is also -times differentiable, then we can “trade off” the regularity:
(B.2.137) |
In this section, we prove Theorem 3.6. Following our conventions throughout this appendix, we write for the compact support of the random variable .
As foreshadowed, we will make a regularity assumption on the support set to prove Theorem 3.6. One possibility for proceeding under minimal assumptions would be to instantiate the results of [RBK18, RKB23] in our setting, since these results apply to sets with very low regularity (e.g., Cantor-like sets with fractal structure). However, we have found precisely computing the constants in these results, a necessary endeavor to assert a conclusion like Theorem 3.6, to be somewhat onerous in our setting. Our approach is therefore to add a geometric regularity assumption on the set that sacrifices some generality, but allows us to develop a more transparent argument. To avoid sacrificing too much generality, we must ensure that low-dimensionality in the set is not prohibited. We therefore consider the running example we have used throughout the book, the mixture of low-rank Gaussian distributions. In this geometric setting, we will enforce this via assuming that is a union of hyperspheres, which is equivalent to the Gaussian assumption in high dimensions with overwhelming probability.
The support of the random variable is a finite union of spheres, each with dimension , . The probability that is drawn from the -th sphere is given by , and conditional on being drawn from the -th sphere, is uniformly distributed on that sphere. The supports satisfy that each sphere is mutually orthogonal with all others.
We proceed under the simplifying Assumption B.3 in order to simplify excessive technicality, and to connect to an important running example used throughout the monograph. We believe our results can be generalized to support from the class of sets with positive reach with additional technical effort, but leave this for the future.
We briefly sketch the proof, then proceed to establishing three fundamental lemmas, then give the proof. The proof will depend on notions introduced in the sketch below.
Obtaining an upper bound on the rate distortion function (3.3.3) is straightforward: by the rate characterization (i.e., the rate distortion function is the minimum rate of a code for with expected squared distortion ), upper bounding only requires demonstrating one code for that achieves this target distortion, and any -covering of achieves this, with rate equal to the base- logarithm of the cardinality of the covering. The lower bound is more subtle. We make use of the Shannon lower bound, discussed in Remark 3.7: working out the constants in [LZ94, §III, (22)] gives a more precise version of the result quoted in Equation 3.3.10 (in bits, of course): for any random variable with compact support and a density, it holds
(B.3.1) |
with entropy (etc.) in nats in this expression. The constant can be easily estimated using Stirling’s approximation. A quantitative form of Stirling’s approximation which is often useful gives for any [Jam15]
(B.3.2) |
We will apply this bound to in Equation B.3.1. We get
(B.3.3) | ||||
(B.3.4) |
which we can take for the explicit value of the constant in Equation 3.3.10. Summarizing the fully quantified Shannon lower bound (in bits):
(B.3.5) |
Now, the important constraint for our current purposes is that the Shannon lower bound requires the random variable to have a density, which rules out many low-dimensional distributions of interest. But let us momentarily consider the situation when does admit a density. The assumption that is uniformly distributed on its support is easily formalized in this setting: for any Borel set , we have
(B.3.6) |
Then the entropy is just
(B.3.7) |
The proof then concludes with a lemma that relates the ratio to the -covering number of by balls.
To extend the program above to degenerate distributions satisfying Assumption B.3, our proof of the lower bound in Theorem 3.6 will leverage an approximation argument of the actual low-dimensional distribution by “nearby” distributions which have densities, similarly but not exactly the same as the proof sketch preceding Theorem B.1. We will then link the parameter introduced in the approximating sequence to the distortion parameter in order to obtain the desired conclusion in Theorem 3.6.
Let be a compact set. For any , define the -thickening of , denoted , by
(B.3.8) |
The distance function referenced in Definition B.1 is defined by
(B.3.9) |
For a compact set , Weierstrass’s theorem implies that for any , there is always some attaining the infimum in the distance function. Compactness of follows readily from compactness of , so is finite for any . It is then possible to make the following definition of a thickened random variable, specialized to Assumption B.3.
Let be a random variable such that is a union of hyperspheres, distributed as in Assumption B.3. Denote the support of each component of the mixture by . Define the thickened random variable as the mixture of measures where each component measure is uniform on the thickened set (Definition B.1), for , with mixing weights .
Suppose the random variable satisfies Assumption B.3. Then if , the thickened random variable (Definition B.2) satisfies for any
(B.3.10) |
The proof of Lemma B.6 is deferred to Section B.3.2. Using Lemma B.6, the above program can be realized, because the random variable has a density that is uniform with respect to the Lebesgue measure.
The upper bound is readily shown. If is any -cover of the support of with cardinality , then consider the coding scheme assigning to each the reconstruction , with ties broken arbitrarily. Then ties occur with probability zero, and the fact that covers at scale guarantees distortion no larger than ; the rate of this scheme is .
For the lower bound, let , and consider the thickened random variable . By Lemma B.6, we have
(B.3.11) |
Since has a Lebesgue density that is uniform, we can then apply the Shannon lower bound, in the form (B.3.5), to get
(B.3.12) |
Finally, we need to lower bound the ratio
(B.3.13) |
in terms of the covering number. Since , where here denotes the Minkowski sum, a standard application of volume bound arguments (see e.g. [Ver18, Proposition 4.2.12]) gives
(B.3.14) |
Hence
(B.3.15) | ||||
(B.3.16) |
Choosing gives from the Shannon lower bound (B.3.12) and the above estimates
(B.3.17) |
as was to be shown.
∎
It suffices to show that any code for with expected squared distortion produces a code for with the same rate and distortion not much larger, for a suitable choice of . So fix such a code for , achieving rate and expected squared distortion . We write for the reconstructed random variable using this code, and for the associated encoding-decoding mapping (i.e., ).
Now let denote the -th hypersphere in the support of . There is an orthonormal basis such that . The following orthogonal decomposition of the support set will be used repeatedly throughout the proof. We have
(B.3.18) | ||||
(B.3.19) |
By orthogonal projection, for any any can be written as , with and . Then for any , we have
(B.3.20) | ||||
(B.3.21) | ||||
(B.3.22) |
Further, it is known that for any nonzero ,
(B.3.23) |
If is zero, it is clear that the above distance is equal to for every . Hence, if we define a projection mapping by
(B.3.24) |
for any with , then . We choose , so that the thickened set contains no points at which any of the projection maps is not well-defined. So the thickened set satisfies
(B.3.25) |
These distances can be rewritten in terms of the orthogonal decomposition as
(B.3.26) | ||||
(B.3.27) | ||||
(B.3.28) |
We are going to show next that every such can be uniquely associated to a projection onto a single subspace in the mixture, which will allow us to define a corresponding projection onto . Given a , by the above, we can find a subspace such that the orthogonal decomposition satisfies
(B.3.29) |
Consider the decomposition for some . We have
(B.3.30) | ||||
(B.3.31) | ||||
(B.3.32) |
where the second line uses the orthogonality assumption on the subspaces , and the third uses the fact that orthogonal projections are nonexpansive. Hence, the -th distance satisfies
(B.3.33) |
This implies that if , every has a unique closest subspace in the mixture. Hence, under this condition, the following mapping is well-defined:
(B.3.34) |
Now, we define a code for by
(B.3.35) |
Clearly this is associated to a rate- code for , because it uses the encoding-decoding mappings from the rate- code for . We have to show that it achieves small distortion. We calculate
(B.3.36) | ||||
(B.3.37) |
where the inequality uses the Minkowski inequality. Now, by Definitions B.1 and B.2, we have deterministically
(B.3.38) |
so the expectation also satisfies this estimate. For the second term, it will suffice to characterize the density of the random variable as being sufficiently close to the density of —which, as Assumption B.3 implies, is a mixture of uniform distributions on each sub-sphere . By the argument above, every point can be associated to one and only one subspace , which means that the mixture components in the definition of (recall Definition B.2) do not overlap. Hence, the density can be characterized by studying the effect of on the conditional random variable , conditioned on being drawn from . Denote this measure by . We claim that the pushforward of this measure under is uniform on . To see that this holds, we recall Equation B.3.28, which gives the characterization
(B.3.39) |
The conditional distribution in question is uniform on this set; we need to show that the projection applied to this conditional random variable yields a random variable that is uniform on . With respect to these coordinates, we have seen that . Hence, for any , we have that the preimage of in under is
(B.3.40) |
To show that is uniform, we need to decompose the integral of the uniform density on in a way that makes it clear that each of the fibers (for each ) “contributes” equally to the integral.666More rigorously, this corresponds to decomposing the uniform density on into a regular conditional density corresponding to , and showing that the corresponding density on is uniform. The proof makes it clear this is true. We have by Definition B.2
(B.3.41) |
In particular, the integration over the orthogonal coordinates factors. Let denote the uniform (Haar) measure on the sphere of radius in . Converting the integral to polar coordinates, we have
(B.3.42) |
Comparing to the fiber representation (B.3.40), we see that we need to “integrate out” over the and components of the preceding integral in order to verify that the pushforward is uniform. But this is evident, as the previous expression shows that the value of this integral is independent of —or, equivalently in context, the value of the spherical component .
Thus it follows from the above argument that is uniform. Because the assumption on implies that the mixture components in the distribution of do not overlap, the mixing weights are also preserved in the image , and in particular, the distribution of is equal to the distribution of . Hence the second term in Equation B.3.37 satisfies
(B.3.43) |
because is a distortion- code for .
We have thus shown that the hypothesized rate-, (expected squared) distortion- code for produces a rate-, (expected squared) distortion code for . This establishes that
(B.3.44) |
as was to be shown.
∎