In this section

Appendix B
Entropy, Diffusion, Denoising, and Lossy Coding

The increase of disorder or entropy with time is one example of what is called an arrow of time, something that distinguishes the past from the future, giving a direction to time.

\(~\) – A Brief History of Time, Stephen Hawking

In this appendix we provide proofs for several facts, mentioned in Chapter 3, which are related to differential entropy, how it evolves under diffusion processes, and its connections to lossy coding. We will make the following mild assumption about the random variable representing the data source, denoted \(\vx \).

Assumption B.1. \(\vx \) is supported on a compact set \(\cS \subseteq \R ^{D}\) of radius at most \(R\), i.e., \(R \doteq \sup _{\vxi \in \cS }\norm {\vxi }_{2}\).

In particular, since compact sets in Euclidean space are bounded, it holds \(R < \infty \). We will consistently use the notation \(B_{r}(\vxi ) \doteq \{\vu \in \R ^{D} \colon \norm {\vxi - \vu }_{2} \leq r\}\) to denote the Euclidean ball of radius \(r\) centered at \(\vxi \). In this sense, Assumption B.1 has \(\cS \subseteq B_{R}(\vzero )\).

Notice that this assumption holds for (almost) all variables we care about in practice, as it is (often) imposed by a normalization step during data pre-processing.

B.1 Differential Entropy of Low-Dimensional Distributions

In this short appendix we discuss the differential entropy of low-dimensional distributions. By definition, the differential entropy of a random variable \(\vx \) which does not have a density is \(-\infty \); this includes all random variables supported on low-dimensional sets. The objective of this section is to discuss why this is a “morally correct” value.

In fact, let \(\vx \) be any random variable such that Assumption B.1 holds, the support \(\cS \) of \(\vx \) has \(0\) volume.1 We will consider the case that \(\vx \) is uniform on \(\cS \).2 Our goal is to compute \(h(\vx )\).

In this case, \(\vx \) would not have a density; in the counterfactual world where we did not know \(h(\vx ) = -\infty \), we could not directly define it using the standard definition of differential entropy. Instead, as in the rest of analysis and information theory it would be reasonable to consider the limit of entropies of successively better approximations \(\vx _{\eps }\) of \(\vx \) which have densities, i.e.,

\begin{equation} h(\vx )\ \text {``=''}\ \lim _{\eps \searrow 0}h(\vx _{\eps }). \tag{B.1.1} \end{equation}

To this end, the basic idea is to take an \(\eps \)-thickening of \(\cS \), say \(\cS _{\eps }\) defined as

\begin{equation} \cS _{\eps } = \bigcup _{\vxi \in \cS }B_{\eps }(\vxi ) \tag{B.1.2} \end{equation}

and visualized in Figure B.1.

[Picture]

Figure B.1: Illustration of the \(\eps \)-thickening \(\cS _\eps \) of a curve \(\cS \subseteq \R ^{2}\).

We will work with random variables whose support is \(\cS _{\eps }\), which is fully-dimensional, and take the limit as \(\eps \to 0\). Indeed, define \(\vx _{\eps } \sim \dUnif (\cS _{\eps })\). Since \(\cS _{\eps }\) has positive volume, \(\vx _{\eps }\) has a density \(p_{\eps }\) equal to

\begin{equation} p_{\eps }(\vxi ) = \indvar (\vxi \in \cS _{\eps }) \cdot \frac {1}{\volume (\cS _{\eps })}. \tag{B.1.3} \end{equation}

Computing the entropy of \(\vx _{\eps }\) using the convention that \(0 \log 0 = 0\), it holds

\begin{align} h(\vx _{\eps }) &= -\int _{\R ^{D}}p_{\eps }(\vxi )\log p_{\eps }(\vxi )\odif {\vxi } \tag{B.1.4} \\ &= -\int _{\cS _{\eps }}\frac {1}{\volume (\cS _{\eps })} \log \rp {\frac {1}{\volume (\cS _{\eps })}}\odif {\vxi } \tag{B.1.5} \\ &= \frac {\log (\volume (\cS _{\eps }))}{\volume (\cS _{\eps })}\int _{\cS _{\eps }}\odif {\vxi } \tag{B.1.6} \\ &= \log (\volume (\cS _{\eps })). \tag{B.1.7} \end{align}

Since \(\cS \) is compact \(\volume (\cS _{\eps })\) is finite and tends to \(0\) as \(\eps \searrow 0\). Thus

\begin{equation} h(\vx ) = \lim _{\eps \searrow 0}h(\vx _{\eps }) = \lim _{\eps \searrow 0}\log (\volume (\cS _{\eps })) = -\infty , \tag{B.1.8} \end{equation}

as desired.

The above calculation is actually a corollary of a much more famous and celebrated set of results about the maximum possible entropy of \(\vx \) subject to certain constraints on the distribution of \(\vx \). We would be remiss to not provide the results here; the proofs are provided in Chapter 2 of [PW22], for example.

Theorem B.1. Let \(\vx \) be a random variable on \(\R ^{D}\).

1.
If \(\vx \) is supported on a compact set \(\cS \subseteq \R ^{D}\) (i.e., Assumption B.1) then
\begin{equation} h(\vx ) \leq h(\dUnif (\cS )) = \log \volume (\cS ). \tag{B.1.9} \end{equation}
2.
If \(\vx \) has finite covariance such that, for a PSD matrix \(\vSigma \in \PSD (D)\), it holds \(\Cov (\vx ) \preceq \vSigma \) (w.r.t. the PSD ordering, i.e., \(\vSigma - \Cov (\vx )\) is PSD), then
\begin{equation} h(\vx ) \leq h(\dNorm (\vzero , \vSigma )) = \frac {1}{2}\log ((2\pi e)^{D}\det \vSigma ). \tag{B.1.10} \end{equation}
3.
If \(\vx \) has finite second moment such that, for a constant \(a \geq 0\), it holds \(\Ex \norm {\vx }_{2}^{2} \leq a\), then
\begin{equation} h(\vx ) \leq h\rp {\dNorm \rp {\vzero , \frac {a}{D}\vI }} = \frac {D}{2}\log \frac {2\pi e a}{D}. \tag{B.1.11} \end{equation}

B.2 Diffusion and Denoising Processes

In the main body (Chapter 3), we considered a random variable \(\vx \), and a stochastic process defined by (3.2.1), i.e.,

\begin{equation}\tag{B.2.1}\label {eq:app_additive_gaussian_noise_model} \vx _{t} = \vx + t\vg , \qquad \forall t \in [0, T] \end{equation}

where \(\vg \sim \dNorm (\vzero , \vI )\) independently of \(\vx \).

The structure of this section is as follows. In Section B.2.1 we provide a formal theorem and crisp proof which shows that under Equation (B.2.1) the entropy increases, i.e., \(\frac{\mathrm{d}}{\mathrm{d}t} h(\vx _{t}) > 0\). In Section B.2.2 we provide a formal theorem and crisp proof which shows that under Equation (B.2.1), the entropy decreases during denoising, i.e., \(h(\Ex [\vx _{s} \given \vx _{t}]) < h(\vx _{t})\) for all \(s < t\). In Section B.2.3 we provide proofs for technical lemmas that are needed to establish the claims in the previous subsections.

Before we start, we introduce some key notations. First, let \(\phi _{t}\) be the density of \(\dNorm (\vzero , t^{2}\vI )\), i.e.,

\begin{equation}\tag{B.2.2}\label {eq:gaussian_noise_time_t} \phi _{t}(\vxi ) \doteq \frac {1}{(2\pi )^{D/2}t^{D}}\exp \rp {-\frac {\norm {\vxi }_{2}^{2}}{2t^{2}}}. \end{equation}

Next, \(\vx _{t}\) is supported on all of \(\R ^{D}\), so it has a density, which we denote \(p_{t}\) (as in the main body). A quick calculation shows that

\begin{equation}\tag{B.2.3}\label {eq:p_t_representation} p_{t}(\vxi ) = \Ex [\phi _{t}(\vxi - \vx )], \end{equation}

and from this representation it is possible to deduce (i.e., from Proposition B.4) that \(p_{t}\) is smooth (i.e., infinitely differentiable) in \(\vxi \), also smooth in \(t\), and positive everywhere. This fact is somewhat remarkable at first sight: even for a completely irregular random variable \(\vx \) (say, a Bernoulli random variable, which does not have a density), its Gaussian smoothing admits a density for every (arbitrarily small) \(t > 0\). The proof is left as an exercise for readers well-versed in mathematical analysis.

However, we also need to add an assumption about the smoothness of the distribution of \(\vx \), which will eliminate some technical problems that occur around \(t = 0\) with low-dimensional distributions.3 Despite this, we expect that our results hold under milder assumptions with additional work. For now, let us assume:

Assumption B.2. \(\vx \) has a twice continuously differentiable density, denoted \(p\).

B.2.1 Diffusion Process Increases Entropy Over Time

In this section we provide a proof of Theorem B.2. For convenience, we restate it as follows.

Theorem B.2 (Diffusion Increases Entropy). Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). Then

\begin{equation}\tag{B.2.4}\label {eq:diffusion_entropy_increases} h(\vx _{s}) < h(\vx _{t}), \qquad \forall s, t \colon 0 \leq s < t \leq T. \end{equation}

Proof. Before we start, we must ask: when does the inequality in (B.2.4) make sense? We will show in Lemma B.1 that under our assumptions, the differential entropy is well-defined, is never \(+\infty \), and for \(t > 0\) is finite, so the (strict) inequality in (B.2.4) makes sense.

The question of well-definedness aside, the crux of this proof is to show that the density \(p_{t}\) of \(\vx _{t}\) satisfies a particular partial differential equation, which is very similar to the heat equation. The heat equation is a famous PDE which describes the diffusion of heat through space. This intuitively should make sense, and paints a mental picture: as the time \(t\) increases, the probability from the original (perhaps tightly concentrated) \(\vx \) disperses across all of \(\R ^{D}\) like heat radiating from a source in a vacuum.

Such PDEs for \(p_{t}\), known as Fokker-Planck equations for more general stochastic processes, are very powerful tools, as they allow us to describe the instantaneous temporal derivatives of \(p_{t}\) in terms of the instantaneous spatial derivatives of \(p_{t}\), and vice versa, providing a concise description of the regularity and dynamics of \(p_{t}\). Once we obtain dynamics for \(p_{t}\), we can then use the system to obtain another one which describes the dynamics of \(h(\vx _{t})\), which after all is just a functional of \(p_{t}\).

The description of the PDE involves a mathematical object called the Laplacian \(\Delta \). Recall from your multivariable calculus class that the Laplacian operating on a differentiable-in-time and twice-differentiable-in-space function \(f \colon (0, T) \times \R ^{D} \to \R \) is given by

\begin{equation} \Delta f_{t}(\vxi ) = \tr (\nabla ^{2}f_{t}(\vxi )) = \sum _{i = 1}^{D}\pddv {f_{t}}{\xi _{i}}(\vxi ). \tag{B.2.5} \end{equation}

Namely, from using the integral representation of \(p_{t}\) and differentiating under the integral, we can compute the derivatives of \(p_{t}\) (which we do in Proposition B.1) and observe that \(p_{t}\) satisfies the heat-like PDE

\begin{equation} \pdv {p_{t}}{t}(\vxi ) = t\Delta p_{t}(\vxi ). \tag{B.2.6} \end{equation}
Then for finding the dynamics of \(h(\vx _{t})\), we can use Proposition B.3 again as well as the heat-like PDE to get
\begin{align} \frac{\mathrm{d}}{\mathrm{d}t} h(\vx _{t}) &= -\frac{\mathrm{d}}{\mathrm{d}t} \int _{\R ^{D}}p_{t}(\vxi )\log p_{t}(\vxi )\odif {\vxi } \tag{B.2.7} \\ &= -\int _{\R ^{D}}\frac{\partial}{\partial t} \bs {p_{t}(\vxi )\log p_{t}(\vxi )}\odif {\vxi } \tag{B.2.8} \\ &= -\int _{\R ^{D}}\pdv {p_{t}}{t}(\vxi )[1 + \log p_{t}(\vxi )]\odif {\vxi } \tag{B.2.9} \\ &= -t\int _{\R ^{D}}\Delta p_{t}(\vxi )[1 + \log p_{t}(\vxi )]\odif {\vxi }. \tag{B.2.10} \end{align}

By using a slightly involved integration by parts argument (Lemma B.2), we obtain

\begin{align} \frac{\mathrm{d}}{\mathrm{d}t} h(\vx _{t}) &= t\int _{\R ^{D}}\ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}\odif {\vxi } \tag{B.2.11} \\ &= t\int _{\R ^{D}}\frac {\norm {\nabla p_{t}(\vxi )}_{2}^{2}}{p_{t}(\vxi )}\odif {\vxi } \tag{B.2.12} \\ &> 0 \tag{B.2.13} \end{align}

where strict inequality holds in the last line because, for it to not hold, \(\nabla p_{t}(\vxi )\) would need to vanish almost everywhere (i.e., everywhere except possibly on a set of zero volume), but this would imply that \(p_{t}\) would be constant almost everywhere, a contradiction with a fact that \(p_{t}\) is a density.

To complete the proof we just use the fundamental theorem of calculus

\begin{equation} h(\vx _{t}) = h(\vx _{s}) + \int _{s}^{t}\frac{\mathrm{d}}{\mathrm{d}u} h(\vx _{u})\odif {u} > h(\vx _{s}), \tag{B.2.14} \end{equation}
which proves the claim. (Note that this does not make sense when \(h(\vx _{s}) = -\infty \), which can only happen when \(s = 0\) and \(h(\vx ) = -\infty \), but in this case \(h(\vx _{t}) > -\infty \) so the claim is vacuously true anyways.) □

B.2.2 Denoising Process Reduces Entropy Over Time

Recall that in Section 3.2.1 we start with the random variable \(\vx _{T}\) and iteratively denoise it using iterations of the form

\begin{equation}\tag{B.2.15}\label {eq:app_denoising_iteration} \hat {\vx }_{s} \doteq \Ex [\vx _{s} \mid \vx _{t} = \hat {\vx }_{t}] = \frac {s}{t}\hat {\vx }_{t} + \bp {1 - \frac {s}{t}}\bar {\vx }^{\ast }(t, \hat {\vx }_{t}). \end{equation}

for \(s, t \in \{t_{0}, t_{1}, \dots , t_{L}\}\) with \(s < t\) and \(\vx _{T} = \hat {\vx }_{T}\). We wish to prove that \(h(\hat {\vx }_{s}) < h(\hat {\vx }_{t})\), showing that the denoising process actually reduces the entropy.

Before we go about doing this, we make several remarks about the problem statement. First, Tweedie’s formula (3.2.23) says that

\begin{equation} \bar {\vx }^{\ast }(t, \vx _{t}) = \vx _{t} + t^{2}\nabla p_{t}(\vx _{t}), \tag{B.2.16} \end{equation}

which likens a full denoising step from time \(t\) to time \(0\) to a gradient step on the log-density of \(\vx _{t}\). Can we get a similar result for the full denoising step from time \(t\) to time \(s\) in (B.2.15)? It turns out that indeed we can, and it is pretty simple. By using (B.2.15) and Tweedie’s formula (3.2.23), we obtain

\begin{equation}\tag{B.2.17}\label {eq:app_denoising_iteration_score} \Ex [\vx _{s} \mid \vx _{t}] = \frac {s}{t}\vx _{t} + \bp {1 - \frac {s}{t}}\bp {\vx _{t} + t^{2}\nabla _{\vx _{t}}\log p_{t}(\vx _{t})} = \vx _{t} + \bp {1 - \frac {s}{t}}t^{2}\nabla _{\vx _{t}}\log p_{t}(\vx _{t}). \end{equation}

So this iterative denoising step is again a gradient step on the perturbed log-density \(\log p_{t}\) with a shrunken step size. In particular, this step can be seen as a perturbation of the distribution of the random variable \(\vx _{t}\) by the score function vector field, suggesting a connection to stochastic differential equations (SDEs) and the theory of diffusion models [SSK+21]. Indeed, a proof of the following result Theorem B.3 can be developed using this powerful machinery and a limiting argument (e.g., following the technical approach in the exposition of [CCL+23]). We will give a simpler proof here, which will use only elementary tools and thereby illuminate some of the key quantities behind the process of entropy reduction via denoising. On the other hand, we will need to deal with some slightly technical calculations due to the fact that the denoising process in Theorem B.3 does not correspond exactly to the reverse process associated to the noise addition process that generates the observation \(\vx _{t}\).4

We want to prove that \(h(\Ex [\vx _{s} \mid \vx _{t}]) < h(\vx _{t})\), i.e., formally:

Theorem B.3. Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). For each \(t > 0\), write

\begin{equation} J(p_{t}) \doteq \int _{\R ^{D}}\frac {\norm {\nabla p_{t}(\vxi )}_{2}^{2}}{p_{t}(\vxi )}\odif {\vxi }, \qquad U_{t} \doteq \max \rp {\frac {D}{t^{2}}, \left\lvert \frac {R^{2}}{t^{4}} - \frac {D}{t^{2}}\right\rvert} \tag{B.2.18} \end{equation}
for the Fisher information of \(p_{t}\) and a uniform bound on \(\abs {\Delta \log p_{t}}\), respectively. Then
\begin{equation} h(\Ex [\vx _{s} \mid \vx _{t}]) < h(\vx _{t}) \tag{B.2.19} \end{equation}
for all \(0 \leq s < t \leq T\) such that
\begin{equation}\tag{B.2.20}\label {eq:conditioning_entropy_condition} \bp {1 - \frac {s}{t}}t^{2}U_{t}^{2} \exp \bp {\bp {1 - \frac {s}{t}}t^{2}U_{t}} < 2J(p_{t}). \end{equation}

Proof. This proof uses two main ideas:

1.
First, write down a density for \(\Ex [\vx _{s} \mid \vx _{t}]\) using a change-of-variables formula.
2.
Second, bound this density to control the entropy.

The change of variables is justified by Corollary B.1, which was originally derived in [Gri11].

We execute these ideas now. From Corollary B.1, we obtain that the function \(\bar {\vx }\) defined as \(\bar {\vx }(\vxi ) \doteq \Ex [\vx _{s} \given \vx _{t} = \vxi ]\) is differentiable, injective, and thus invertible on its range, which we henceforth denote \(\cX \subseteq \R ^{D}\). We denote its inverse as \(\bar {\vx }^{-1}\). Using a change-of-variables formula, the density \(\bar {p}\) of \(\bar {\vx }(\vx _{t})\) is given by

\begin{equation} \bar {p}(\vxi ) \doteq \frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}, \tag{B.2.21} \end{equation}
where (recall, from Section A.2) \(\bar {\vx }^{\prime }\) is the Jacobian of \(\bar {\vx }\). Since from Lemma B.3 we know \(\bar {\vx }^{\prime }\) is a positive definite matrix, the determinant is positive and so the whole density is positive. Then it follows that
\begin{align} h(\bar {\vx }(\vx _{t})) &= -\int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))} \log \frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))} \odif {\vxi } \tag{B.2.22} \\ &= -\int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))} \log ((p_{t} \circ \bar {\vx }^{-1})(\vxi ))\odif {\vxi } \tag{B.2.23} \\ &\qquad + \int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}\log \det \rp {\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi ))}\odif {\vxi } \tag{B.2.24} \\ &= -\int _{\R ^{D}}p_{t}(\vxi )\log p_{t}(\vxi )\odif {\vxi } + \int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}\log \det \rp {\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi ))}\odif {\vxi } \tag{B.2.25} \\ &= h(\vx _{t}) - \int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}\log \rp {\frac {1}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}}\odif {\vxi }. \tag{B.2.26} \end{align}

We will study the last term (including the \(-\)), and show that it is negative.

By concavity, one has \(-x\log x \leq 1 - x\) for every \(x \geq 0\). Hence

\begin{align} h(\bar {\vx }(\vx _{t})) - h(\vx _{t}) &= - \int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}\log \rp {\frac {1}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}}\odif {\vxi } \tag{B.2.27} \\ &\leq \int _{\cX }(p_{t} \circ \bar {\vx }^{-1})(\vxi )\cdot \bp {1 - \frac {1}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}}\odif {\vxi } \tag{B.2.28} \\ &= \int _{\cX }(p_{t} \circ \bar {\vx }^{-1})(\vxi )\odif {\vxi } - \int _{\cX }\frac {(p_{t} \circ \bar {\vx }^{-1})(\vxi )}{\det (\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi )))}\odif {\vxi } \tag{B.2.29} \\ &= \int _{\R ^{D}}p_{t}(\vxi )\det \rp {\bar {\vx }^{\prime }(\bar {\vx }^{-1}(\vxi ))}\odif {\vxi } - \int _{\cX }\bar {p}(\vxi )\odif {\vxi } \tag{B.2.30} \\ &= \int _{\R ^{D}}p_{t}(\vxi )\det \rp {\vI + \bp {1 - \frac {s}{t}}t^{2}\nabla ^{2}\log p_{t}(\vxi )}\odif {\vxi } - 1. \tag{B.2.31} \end{align}

Now, by the AM-GM inequality on eigenvalues, we have for any symmetric positive definite matrix \(\vM \in \PSD (D)\) the bound

\begin{equation} \det (\vM )^{1/D} = \prod _{i = 1}^{D}\lambda _{i}(\vM )^{1/D} \leq \frac {\sum _{i = 1}^{D}\lambda _{i}(\vM )}{D} = \frac {\tr (\vM )}{D}, \tag{B.2.32} \end{equation}
which we can apply to the above expression and obtain
\begin{align} &\int _{\R ^{D}}p_{t}(\vxi )\det \rp {\vI + \bp {1 - \frac {s}{t}}t^{2}\nabla ^{2}\log p_{t}(\vxi )}\odif {\vxi } \tag{B.2.33} \\ &\leq \int _{\R ^{D}} p_{t}(\vxi ) \tr \rp {\frac {1}{D}\bs {\vI + \bp {1 - \frac {s}{t}}t^{2}\nabla ^{2}\log p_{t}(\vxi )}}^{D}\odif {\vxi } \tag{B.2.34} \\ &= \int _{\R ^{D}} p_{t}(\vxi )\bp {1 + \frac {\bp {1 - \frac {s}{t}}t^{2}}{D}\tr (\nabla ^{2}\log p_{t}(\vxi ))}^{D}\odif {\vxi } \tag{B.2.35} \\ &= \int _{\R ^{D}} p_{t}(\vxi )\bp {1 + \frac {\bp {1 - \frac {s}{t}}t^{2}}{D}\Delta \log p_{t}(\vxi )}^{D}\odif {\vxi }. \tag{B.2.36} \end{align}

Since \(\bar {\vx }^{\prime }(\vxi )\) is positive definite (Corollary B.1), we have \(1 + \frac {(1 - s/t)t^{2}}{D}\Delta \log p_{t}(\vxi ) = \frac {\tr (\bar {\vx }^{\prime }(\vxi ))}{D} > 0\). By the inequality \(\log (1 + x) \leq x\) for \(x > -1\),

\begin{equation} \bp {1 + \frac {\bp {1 - \frac {s}{t}}t^{2}}{D}\Delta \log p_{t}(\vxi )}^{D} \leq \exp \bp {\bp {1 - \frac {s}{t}}t^{2}\Delta \log p_{t}(\vxi )}. \tag{B.2.37} \end{equation}
From Lemma B.5 (where, recall, \(R\) is the radius of the support of \(\vx \) as in Assumption B.1),
\begin{equation} \abs {\Delta \log p_{t}(\vxi )} \leq \max \rp {\frac {D}{t^{2}}, \left\lvert \frac {R^{2}}{t^{4}} - \frac {D}{t^{2}}\right\rvert} =: U_{t}. \tag{B.2.38} \end{equation}
Setting \(\varepsilon \doteq \bp {1 - \frac {s}{t}}t^{2}U_{t}\), we have \(\bp {1 - \frac {s}{t}}t^{2}\Delta \log p_{t}(\vxi ) \leq \varepsilon \) for all \(\vxi \). By Taylor’s theorem with Lagrange remainder, for any \(c \leq \varepsilon \),
\begin{equation} e^{c} \leq 1 + c + \frac {c^{2}}{2}e^{\varepsilon }, \tag{B.2.39} \end{equation}
since \(e^{c} = 1 + c + \frac {c^{2}}{2}e^{\theta c}\) for some \(\theta \in (0, 1)\), and \(e^{\theta c} \leq e^{\max (c, 0)} \leq e^{\varepsilon }\). Applying these bounds and integrating,
\begin{align} &\int _{\R ^{D}} p_{t}(\vxi )\bp {1 + \frac {\bp {1 - \frac {s}{t}}t^{2}}{D}\Delta \log p_{t}(\vxi )}^{D}\odif {\vxi } \tag{B.2.40} \\ &\leq \int _{\R ^{D}} p_{t}(\vxi )\exp \bp {\bp {1 - \frac {s}{t}}t^{2}\Delta \log p_{t}(\vxi )}\odif {\vxi } \tag{B.2.41} \\ &\leq 1 + \bp {1 - \frac {s}{t}}t^{2}\int _{\R ^{D}} p_{t}(\vxi )\Delta \log p_{t}(\vxi )\odif {\vxi } + \frac {e^{\varepsilon }}{2}\bp {1 - \frac {s}{t}}^{2}t^{4}\int _{\R ^{D}}p_{t}(\vxi )\bp {\Delta \log p_{t}(\vxi )}^{2}\odif {\vxi } \tag{B.2.42} \\ &= 1 - \bp {1 - \frac {s}{t}}t^{2}\int _{\R ^{D}}\frac {\norm {\nabla p_{t}(\vxi )}_{2}^{2}}{p_{t}(\vxi )}\odif {\vxi } + \frac {e^{\varepsilon }}{2}\bp {1 - \frac {s}{t}}^{2}t^{4}\int _{\R ^{D}}p_{t}(\vxi )\bp {\Delta \log p_{t}(\vxi )}^{2}\odif {\vxi }, \tag{B.2.43} \end{align}

where the identity \(\int p_{t}\Delta \log p_{t} = -\int \norm {\nabla p_{t}}_{2}^{2}/p_{t}\) is the same as in the proof of Theorem B.2. Writing \(J(p_{t}) \doteq \int _{\R ^{D}}\frac {\norm {\nabla p_{t}(\vxi )}_{2}^{2}}{p_{t}(\vxi )}\odif {\vxi }\) for the Fisher information of \(p_{t}\), and bounding \(\bp {\Delta \log p_{t}}^{2} \leq U_{t}^{2}\), we combine with our previous estimate to obtain

\begin{equation} h(\bar {\vx }(\vx _{t})) - h(\vx _{t}) \leq -\bp {1 - \frac {s}{t}}t^{2}J(p_{t}) + \frac {e^{\varepsilon }\varepsilon ^{2}}{2}. \tag{B.2.44} \end{equation}
This is strictly negative whenever \(\varepsilon ^{2}e^{\varepsilon } < 2\bp {1 - \frac {s}{t}}t^{2}J(p_{t})\). Substituting \(\varepsilon = \bp {1 - \frac {s}{t}}t^{2}U_{t}\), this is precisely condition (B.2.20). □

Notice that condition (B.2.20) is always satisfiable: for any fixed \(t > 0\), the Fisher information satisfies \(J(p_{t}) > 0\) (since \(p_{t}\) is not constant) and \(U_{t} < \infty \), so taking \(\bp {1 - \frac {s}{t}}t^{2}\) small enough ensures (B.2.20) holds.

To understand the quantitative implications, consider the behavior as \(t \to 0\) (near the data), where the condition is most restrictive. Setting \(s = (1 - \eps )t\), the condition becomes \(\eps t^{2}U_{t}^{2}\exp (\eps t^{2}U_{t}) < 2J(p_{t})\). For small \(t\), the Laplacian bound satisfies \(U_{t} \approx R^{2}/t^{4}\), so the polynomial prefactor scales as \(\eps t^{2}U_{t}^{2} \approx \eps R^{4}/t^{6}\). The critical scaling is therefore \(\eps \sim t^{6}/R^{4}\). Taking \(\eps = \alpha t^{6}/R^{4}\), the condition reduces to \(\alpha < 2J(p_{t})\). In other words, each denoising step of size \(t - s = \eps \, t \sim t^{7}/R^{4}\) reduces the entropy. The steps must become increasingly fine as \(t \to 0\), a consequence of the second-order remainder in the Taylor expansion used in the proof.

At the same time, it should be noted that our bounds are likely loose in terms of the precise \(t\)-dependence, and certainly so for data with greater structure. For instance, the \(U_{t} \sim R^{2}/t^{4}\) blowup reflects a worst-case Laplacian bound that can be considerably tightened under additional regularity assumptions on \(p\). Moreover, if \(p_{t}\) is log-concave (as holds, e.g., when \(p\) itself is log-concave), then \(\Delta \log p_{t} \leq 0\) everywhere, and a significantly simplified argument can be run with strengthened conclusions.

B.2.3 Technical Lemmas and Intermediate Results

In this subsection we present technical results which power our main two conceptual theorems. Our presentation will be more or less standard for mathematics; we will start with the higher-level results first, and gradually move back to the more incremental results. The higher-level results will use the incremental results, and in this way we have an easy-to-read dependency ordering of the results: no result depends on those before it. Results which do not depend on each other are generally ordered by the place they appear in the above pair of proofs.

Finiteness of the Differential Entropy

We first show that the entropy exists along the stochastic process and is finite.

Lemma B.1. Let \(\vx \) be any random variable, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1).

1.
For \(t > 0\), the differential entropy \(h(\vx _{t})\) exists and is \(> -\infty \).
2.
If in addition Assumption B.1 holds for \(\vx \), then \(h(\vx ) < \infty \) and \(h(\vx _{t})\ < \infty \).

Proof. To prove Lemma B.1.1, we use a classic yet tedious analysis argument. Since \(\vx _{t}\) has a density, we can write

\begin{equation} h(\vx _{t}) = -\int _{\R ^{D}}p_{t}(\vxi )\log p_{t}(\vxi ) \odif {\vxi }. \tag{B.2.45} \end{equation}
Accordingly, let \(g \colon \R ^{D} \to \R \) be defined as
\begin{equation} g(\vxi ) \doteq -p_{t}(\vxi )\log p_{t}(\vxi ) \implies h(\vx _{t}) = \int _{\R ^{D}}g(\vxi )\odif {\vxi }. \tag{B.2.46} \end{equation}
As usual to bound the value of an integral in analysis, define
\begin{eqnarray} &&g_{+}(\vxi ) \doteq \max (g(\vxi ), 0), \quad g_{-}(\vxi ) \doteq \max (-g(\vxi ), 0) \tag{B.2.47} \\ && \implies \quad g = g_{+} - g_{-}\quad \text {and} \quad g_{+}, g_{-} \geq 0. \tag{B.2.48} \end{eqnarray}

Then

\begin{equation} h(\vx _{t}) = \int _{\R ^{D}}g_{+}(\vxi )\odif {\vxi } - \int _{\R ^{D}}g_{-}(\vxi )\odif {\vxi }, \tag{B.2.49} \end{equation}
and both integrals are guaranteed to be non-negative since their integrands are.

In order to show that \(h(\vx _{t})\) is well-defined, we need to show that \(\int _{\R ^{D}}g_{+}(\vxi )\odif {\vxi } < \infty \) or \(\int _{\R ^{D}}g_{-}(\vxi ) \odif {\vxi } < \infty \). To show that \(h(\vx _{t}) > -\infty \), it merely suffices to show that \(\int _{\R ^{D}}g_{-}(\vxi )\odif {\vxi } < \infty \). To bound the integral of \(g_{-}\) we need to understand the quantity \(g_{-}\), namely, we want to characterize when \(g\) is negative.

\begin{equation} g(\vxi ) \leq 0 \iff p_{t}(\vxi )\log p_{t}(\vxi ) \geq 0 \iff \log p_{t}(\vxi ) \geq 0 \iff p_{t}(\vxi ) \geq 1. \tag{B.2.50} \end{equation}
Thus, it holds that
\begin{equation} g_{-}(\vxi ) = \indvar (p_{t}(\vxi ) \geq 1)\cdot (-g(\vxi )) = \indvar (p_{t}(\vxi ) \geq 1)p_{t}(\vxi )\log p_{t}(\vxi ). \tag{B.2.51} \end{equation}
In order to bound the integral of \(g_{-}(\vxi )\), we need to show that \(p_{t}\) is “not too concentrated,” namely that \(p_{t}\) is not too large. To prove this, in this case we are lucky enough to be able to bound the function \(g_{-}(\vxi )\) itself. Namely, notice that
\begin{equation} \max _{\vxi \in \R ^{D}}\phi _{t}(\vxi - \vx ) = \phi _{t}(\vzero ) = \frac {1}{(2\pi )^{D/2}t^{D}} =: C_{t}. \tag{B.2.52} \end{equation}
which blows up as \(t \to 0\) but is finite for all finite \(t\). Therefore
\begin{equation} p_{t}(\vxi ) = \Ex \phi _{t}(\vxi - \vx ) \leq \Ex C_{t} = C_{t}. \tag{B.2.53} \end{equation}
Now there are two cases.

Hence we have \(\int _{\R ^{D}}g_{-}(\vxi )\odif {\vxi } < \infty \), so the differential entropy \(h(\vx _{t})\) exists and is \(> -\infty \).

To prove Lemma B.1.2, suppose that Assumption B.1 holds. We want to show that \(h(\vx ) < \infty \) and \(h(\vx _{t}) < \infty \). The mechanism for doing this is the same, and involves the maximum entropy result Theorem B.1. Namely, since \(\vx \) is absolutely bounded, it has a finite covariance which we will denote \(\vSigma \). Then the covariance of \(\vx _{t}\) is \(\vSigma + t^{2}\vI \). Thus the entropy of \(\vx \) and \(\vx _{t}\) can be upper bounded by the entropy of normal distributions with the respective covariances, i.e., \(\log [(2\pi e)^{D}\det (\vSigma )]\) or \(\log [(2\pi e)^{D}\det (\vSigma + t^{2}\vI )]\), and both are \(< \infty \). □

Integration by Parts in De Bruijn Identity

Finally, we fill in the integration-by-parts argument alluded to in the proofs of Theorems B.2 and B.3. The argument is conceptually pretty simple but requires some technical estimates to show that the boundary term in the integration-by-parts vanishes.

Lemma B.2. Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). For \(t \geq 0\), let \(p_{t}\) be the density of \(\vx _{t}\). Then for a constant \(c \in \R \) it holds

\begin{equation} \int _{\R ^{D}}\Delta p_{t}(\vxi )[c + \log p_{t}(\vxi )]\odif {\vxi } = -\int _{\R ^{D}}\ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}\odif {\vxi }. \tag{B.2.58} \end{equation}

Proof. The basic idea of this proof is in two steps:

Green’s theorem says that for any compact set \(\cK \subseteq \R ^{D}\), twice continuously differentiable \(\plainphi \colon \R ^{D} \to \R \), and continuously differentiable \(\psi \colon \R ^{D} \to \R \),

\begin{equation} \int _{\cK }\bc {\psi (\vxi ) \Delta \plainphi (\vxi ) + \ip {\nabla \psi (\vxi )}{\nabla \plainphi (\vxi )}}\odif {\vxi } = \int _{\partial \cK }\psi (\vxi )\ip {\nabla \plainphi (\vxi )}{\vn (\vxi )}\odif {\sigma (\vxi )} \tag{B.2.59} \end{equation}
where \(\odif {\sigma (\vxi )}\) denotes an integral over the “surface measure”, i.e., the inherited measure on \(\partial \cK \), namely the boundary of \(\cK \), and accordingly \(\vxi \) takes values on this surface and \(\vn (\vxi )\) is the unit normal vector to \(\cK \) at the surface point \(\vxi \). Now, taking \(\plainphi (\vxi ) \doteq p_{t}(\vxi )\) and \(\psi (\vxi ) \doteq c + \log p_{t}(\vxi )\), over a ball \(B_{r}(\vzero )\) of radius \(r > 0\) centered at \(\vzero \) (so that \(\partial B_{r}(\vzero )\) is the sphere of radius \(r\) centered at \(\vzero \) and \(\vn (\vxi ) = \vxi /\norm {\vxi }_{2} = \vxi /r\)):
\begin{align} &\int _{B_{r}(\vzero )}\bc {\Delta p_{t}(\vxi )[c + \log p_{t}(\vxi )] + \ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}}\odif {\vxi } \tag{B.2.60} \\ &= \int _{\partial B_{r}(\vzero )}[c + \log p_{t}(\vxi )]\left\langle \nabla p_{t}(\vxi ), \frac {\vxi }{r}\right\rangle\odif {\sigma (\vxi )} \tag{B.2.61} \\ &= \frac {1}{r}\int _{\partial B_{r}(\vzero )}[c + \log p_{t}(\vxi )]\left\langle \nabla p_{t}(\vxi ), \vxi \right\rangle\odif {\sigma (\vxi )}. \tag{B.2.62} \end{align}

Sending \(r \to \infty \), it holds that

\begin{align} &\int _{\R ^{D}}\bc {\Delta p_{t}(\vxi )[c + \log p_{t}(\vxi )] + \ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}}\odif {\vxi } \tag{B.2.63}\label {eq:app_diffusion_r_infinity_limit} \\ &= \lim _{r \to \infty }\int _{B_{r}(\vzero )}\bc {\Delta p_{t}(\vxi )[c + \log p_{t}(\vxi )] + \ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}}\odif {\vxi } \tag{B.2.64} \\ &= \lim _{r \to \infty }\frac {1}{r}\int _{\partial B_{r}(\vzero )}[c + \log p_{t}(\vxi )]\left\langle \nabla p_{t}(\vxi ), \vxi \right\rangle\odif {\sigma (\vxi )}, \tag{B.2.65} \end{align}

where the first equality follows by dominated convergence on the integrand. It remains to compute the last limit. For this, we take asymptotic expansions of each term. The main device is as follows: for \(\vxi \in \partial B_{r}(\vzero )\), we have \(\norm {\vxi }_{2} = r\), so

\begin{align} p_{t}(\vxi ) &= \Ex [\phi _{t}(\vxi - \vx )] \tag{B.2.66} \\ &= \Ex \rs {\underbrace {\frac {1}{(2\pi )^{D/2}t^{D}}}_{\doteq C_{t}}e^{-\|\vxi - \vx \|_{2}^{2}/(2t^{2})}} \tag{B.2.67} \\ &= C_{t}\Ex \rs {e^{-(\norm {\vxi }_{2}^{2} - 2\ip {\vxi }{\vx } + \norm {\vx }_{2}^{2})/(2t^{2})}} \tag{B.2.68} \\ &= C_{t}\Ex \rs {e^{-(r^{2} - 2\ip {\vxi }{\vx } + \norm {\vx }_{2}^{2})/(2t^{2})}} \tag{B.2.69} \\ &= C_{t}e^{-r^{2}/(2t^{2})}\Ex [e^{(2\ip {\vxi }{\vx } - \norm {\vx }_{2}^{2})/(2t^{2})}]. \tag{B.2.70} \end{align}

Note that because \(\norm {\vxi }_{2} = r\), we have by Cauchy-Schwarz that

\begin{equation} -2r\norm {\vx }_{2} - \norm {\vx }_{2}^{2} \leq 2\ip {\vxi }{\vx } - \norm {\vx }_{2}^{2} \leq 2r\norm {\vx }_{2} - \norm {\vx }_{2}^{2}. \tag{B.2.71} \end{equation}
Recall that by Assumption B.1, \(\vx \) is supported on a compact set \(\cS \) of radius \(R\). Thus
\begin{equation} -2R(r + R) \leq 2\ip {\vxi }{\vx } - \norm {\vx }_{2}^{2} \leq 2Rr. \tag{B.2.72} \end{equation}
In other words, it holds
\begin{equation} C_{t}e^{-[r^{2} + 2R(r + R)]/(2t^{2})} \leq p_{t}(\vxi ) \leq C_{t}e^{[-r^{2} + 2Rr]/(2t^{2})}. \tag{B.2.73} \end{equation}
Now to compute the gradient, we can use Proposition B.1 and linearity of expectation to compute
\begin{align} \ip {\nabla p_{t}(\vxi )}{\vxi } &= \left\langle -\frac {1}{t^{2}}\Ex \rs {\bp {\vxi - \vx }\phi _{t}(\vxi - \vx )}, \vxi \right\rangle \tag{B.2.74} \\ &= -\frac {1}{t^{2}}\Ex \rs {\ip {\vxi - \vx }{\vxi }\phi _{t}(\vxi - \vx )} \tag{B.2.75} \\ &= -\frac {1}{t^{2}}\Ex \rs {\bp {\norm {\vxi }_{2}^{2} - \ip {\vxi }{\vx }}\phi _{t}(\vxi - \vx )} \tag{B.2.76} \\ &= -\frac {1}{t^{2}}\Ex \rs {\bp {r^{2} - \ip {\vxi }{\vx }}\phi _{t}(\vxi - \vx )} \tag{B.2.77} \\ &= \frac {1}{t^{2}}\Ex \rs {\bp {\ip {\vxi }{\vx } - r^{2}}\phi _{t}(\vxi - \vx )}. \tag{B.2.78} \end{align}

Using Cauchy-Schwarz and the representation \(p_{t}(\vxi ) \doteq \Ex [\phi _{t}(\vxi - \vx )]\) again, it holds

\begin{align} &\frac {1}{t^{2}}\Ex \rs {\bp {-Rr - r^{2}}\phi _{t}(\vxi - \vx )} \leq \ip {\nabla p_{t}(\vxi )}{\vxi } \leq \frac {1}{t^{2}}\Ex \rs {\bp {Rr - r^{2}}\phi _{t}(\vxi - \vx )} \tag{B.2.79} \\ &\frac {1}{t^{2}}\bp {-Rr - r^{2}}\Ex \rs {\phi _{t}(\vxi - \vx )} \leq \ip {\nabla p_{t}(\vxi )}{\vxi } \leq \frac {1}{t^{2}}\bp {Rr - r^{2}}\Ex \rs {\phi _{t}(\vxi - \vx )} \tag{B.2.80} \\ &-\frac {r(R + r)}{t^{2}}p_{t}(\vxi ) \leq \ip {\nabla p_{t}(\vxi )}{\vxi } \leq -\frac {r(r - R)}{t^{2}}p_{t}(\vxi ). \tag{B.2.81} \end{align}

For \(r > R > 0\) (as is suitable, because we are going to take the limit \(r \to \infty \) while \(R\) is fixed), both sides are negative. This makes sense: most of the probability mass is contained within the ball of radius \(R\) and thus the score points inwards, having a negative inner product with the outward-pointing vector \(\vxi \). Thus using the appropriate bounds for \(p_{t}(\vxi )\),

\begin{equation} -\frac {r(R + r)}{t^{2}}\cdot C_{t}e^{[-r^{2} + 2Rr]/(2t^{2})} \leq \ip {\nabla p_{t}(\vxi )}{\vxi } \leq -\frac {r(r - R)}{t^{2}}\cdot C_{t}e^{-[r^{2} + 2R(r + R)]/(2t^{2})}. \tag{B.2.82} \end{equation}
Then, noting that \(C_{t} = \mathrm {poly}(t^{-1})\), we can compute
\begin{equation} [c + \log p_{t}(\vxi )]\ip {\nabla p_{t}(\vxi )}{\vxi } = \mathrm {poly}(r, R, t^{-1}, c)e^{-\Theta _{r}(r^{2})} \tag{B.2.83} \end{equation}
So one can see that, letting the surface area of \(\partial B_{r}(\vzero )\) be \(\omega _{D} r^{D - 1}\) where \(\omega _{D}\) is a function of \(D\), it holds
\begin{equation} \frac {1}{r}\int _{\partial B_{r}(\vzero )}[c + \log p_{t}(\vxi )]\ip {\nabla p_{t}(\vxi )}{\vxi }\odif {\vxi } = \mathrm {poly}(r, R, t^{-1}, c)e^{-\Theta _{r}(r^{2})} \tag{B.2.84} \end{equation}
and therefore the exponentially decaying tails mean
\begin{equation} \lim _{r \to \infty }\frac {1}{r}\int _{\partial B_{r}(\vzero )}[c + \log p_{t}(\vxi )]\ip {\nabla p_{t}(\vxi )}{\vxi }\odif {\vxi } = 0. \tag{B.2.85} \end{equation}
Finally, plugging into (B.2.63), we have
\begin{align} &\int _{\R ^{D}}\bc {\Delta p_{t}(\vxi )[c + \log p_{t}(\vxi )] + \ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}}\odif {\vxi } = 0 \tag{B.2.86} \\ \implies &\int _{\R ^{D}}\Delta p_{t}(\vxi )[c + \log p_{t}(\vxi )]\odif {\vxi } = -\int _{\R ^{D}}\ip {\nabla \log p_{t}(\vxi )}{\nabla p_{t}(\vxi )}\odif {\vxi } \tag{B.2.87} \end{align}

as claimed. □

Local Invertibility of the Denoiser \(\bar {\vx }\)

Here we provide some results used in the proof of Theorem B.3 which are appropriate generalizations of corresponding results in [Gri11].

Lemma B.3 (Generalization of [Gri11], Lemma A.1). Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). Let \(s, t \in [0, T]\) be such that \(0 \leq s < t \leq T\), and let \(\bar {\vx }(\vxi ) \doteq \Ex [\vx _{s} \mid \vx _{t} = \vxi ]\). The Jacobian \(\bar {\vx }^{\prime }(\vxi )\) is symmetric positive definite.

Proof. We have

\begin{equation} \bar {\vx }^{\prime }(\vxi ) = \vI + \bp {1 - \frac {s}{t}}t^{2}\nabla ^{2}\log p_{t}(\vxi ). \tag{B.2.88} \end{equation}
Here we expand
\begin{equation} \nabla ^{2}\log p_{t}(\vxi ) = \frac {p_{t}(\vxi )\nabla ^{2}p_{t}(\vxi ) - (\nabla p_{t}(\vxi ))(\nabla p_{t}(\vxi ))^{\top }}{p_{t}(\vxi )^{2}}. \tag{B.2.89} \end{equation}
So we need to ensure that
\begin{align} \bar {\vx }^{\prime }(\vxi ) &= \vI + \bp {1 - \frac {s}{t}}t^{2}\frac {p_{t}(\vxi )\nabla ^{2}p_{t}(\vxi ) - (\nabla p_{t}(\vxi ))(\nabla p_{t}(\vxi ))^{\top }}{p_{t}(\vxi )^{2}} \tag{B.2.90} \\ &= \frac {p_{t}(\vxi )^{2}\vI + \bp {1 - \frac {s}{t}}t^{2}\bs {p_{t}(\vxi )\nabla ^{2}p_{t}(\vxi ) - (\nabla p_{t}(\vxi ))(\nabla p_{t}(\vxi ))^{\top }}}{p_{t}(\vxi )^{2}} \tag{B.2.91} \end{align}

is symmetric positive semidefinite. Indeed it is obviously symmetric (by Clairaut’s theorem). To show its positive semidefiniteness, we plug in the expectation representation of \(p_{t}\) given by (B.2.3) (and \(\nabla p_{t}\), \(\Delta p_{t}\) by Proposition B.1) to obtain (where \(\vx \) is as defined and \(\vy \) is i.i.d. as \(\vx \)),

\begin{align} &\vv ^{\top }[\bar {\vx }^{\prime }(\vxi )]\vv \tag{B.2.92} \\ &= p_{t}(\vxi )^{-2}\vv ^{\top }\Bigg \{p_{t}(\vxi )^{2}\vI + \bp {1 - \frac {s}{t}}t^{2}\Ex [\phi _{t}(\vxi - \vx )]\Ex \rs {\phi _{t}(\vxi - \vx )\cdot \frac {(\vxi - \vx )(\vxi - \vx )^{\top } - t^{2}\vI }{t^{4}}} \tag{B.2.93} \\ & \qquad \qquad \qquad -\bp {1 - \frac {s}{t}}t^{2}\Ex \rs {-\phi _{t}(\vxi - \vx )\cdot \frac {\vxi - \vx }{t^{2}}}\Ex \rs {-\phi _{t}(\vxi - \vx )\cdot \frac {\vxi - \vx }{t^{2}}}^{\top }\Bigg \}\vv \tag{B.2.94} \\ &= p_{t}(\vxi )^{-2}\vv ^{\top }\Bigg \{\Ex [\phi _{t}(\vxi - \vx )\phi _{t}(\vxi - \vy )\vI ] \tag{B.2.95} \\ & \qquad \qquad \qquad + \bp {1 - \frac {s}{t}}t^{2}\Ex \rs {\phi _{t}(\vxi - \vx )\phi _{t}(\vxi - \vy )\cdot \frac {(\vxi - \vy )(\vxi - \vy )^{\top } - t^{2}\vI }{t^{4}}} \tag{B.2.96} \\ & \qquad \qquad \qquad -\bp {1 - \frac {s}{t}}t^{2}\Ex \rs {\phi _{t}(\vxi - \vx )\phi _{t}(\vxi - \vy )\cdot \frac {(\vxi - \vx )(\vxi - \vy )^{\top }}{t^{4}}}\Bigg \} \tag{B.2.97} \\ &= \frac {1 - s/t}{p_{t}(\vxi )^{2}}\vv ^{\top }\Ex \mathopen {}\Bigg [\phi _{t}(\vxi - \vx )\phi _{t}(\vxi - \vy )\bc {\frac {1}{1 - s/t}\vI + \frac {(\vxi - \vy )(\vxi - \vy )^{\top }}{t^{2}} - \vI - \frac {(\vxi - \vx )(\vxi - \vy )^{\top }}{t^{2}}}\Bigg ]\vv \tag{B.2.98} \\ &= \frac {t - s}{tp_{t}(\vxi )^{2}}\vv ^{\top }\Ex \rs {\frac {s}{t - s}\vI +\frac {(\vxi - \vx )(\vxi - \vx )^{\top }}{2t^{2}} + \frac {(\vxi - \vy )(\vxi - \vy )^{\top }}{2t^{2}} - \frac {(\vxi - \vx )(\vxi - \vy )^{\top }}{t^{2}}}\vv \tag{B.2.99} \\ &= \frac {t - s}{tp_{t}(\vxi )^{2}}\vv ^{\top }\Ex \rs {\frac {s}{t - s}\vI +\frac {1}{2t^{2}}\bp {(\vxi - \vx )(\vxi - \vx )^{\top } + (\vxi - \vy )(\vxi - \vy )^{\top } - 2(\vxi - \vx )(\vxi - \vy )^{\top }}}\vv \tag{B.2.100} \\ &= \frac {t - s}{tp_{t}(\vxi )^{2}}\Ex \rs {\frac {s}{t - s}\norm {\vv }_{2}^{2} +\frac {1}{2t^{2}}\bp {[(\vxi - \vx )^{\top }\vv ]^{2} + [(\vxi - \vy )^{\top }\vv ]^{2} - 2[(\vxi - \vx )^{\top }\vv ][(\vxi - \vy )^{\top }\vv ]}} \tag{B.2.101} \\ &= \frac {t - s}{tp_{t}(\vxi )^{2}}\Ex \rs {\frac {s}{t - s}\norm {\vv }_{2}^{2} +\frac {1}{2t^{2}}\bp {[(\vxi - \vx )^{\top }\vv ]^{2} + [(\vxi - \vy )^{\top }\vv ]^{2} - 2[(\vxi - \vx )^{\top }\vv ][(\vxi - \vy )^{\top }\vv ]}} \tag{B.2.102} \\ &= \frac {t - s}{tp_{t}(\vxi )^{2}}\Ex \rs {\frac {s}{t - s}\norm {\vv }_{2}^{2} +\frac {1}{2t^{2}}\bp {[(\vxi - \vx )^{\top }\vv ] - [(\vxi - \vy )^{\top }\vv ]}^{2}} \tag{B.2.103} \\ &= \frac {t - s}{tp_{t}(\vxi )^{2}}\Ex \rs {\frac {s}{t - s}\norm {\vv }_{2}^{2} +\frac {1}{2t^{2}}[(\vy - \vx )^{\top }\vv ]^{2}} \tag{B.2.104} \\ &= \frac {s}{tp_{t}(\vxi )^{2}}\norm {\vv }_{2}^{2} + \frac {t - s}{2t^{3}p_{t}(\vxi )}\Ex [\{(\vy - \vx )^{\top }\vv \}^{2}] \tag{B.2.105} \end{align}

Since \(\vx \) and \(\vy \) are i.i.d., the whole integral (i.e., the original quadratic form) is \(0\) if and only if \(s = 0\) and \(\vx \) has support entirely contained in an affine subspace which is orthogonal to \(\vv \). But this is ruled out by assumption (i.e., that \(\vx \) has a density on \(\R ^{D}\)), so the Jacobian \(\bar {\vx }^{\prime }(\vxi )\) is symmetric positive definite. □

Lemma B.4 (Generalization of [Gri11] Corollary A.2, Part 1). Let \(f \colon \R ^{D} \to \R ^{D}\) be any differentiable function whose Jacobian \(f^{\prime }(\vx )\) is symmetric positive definite. Then \(f\) is injective, and hence invertible as a function \(\R ^{D} \to \Range (f)\) where \(\Range (f)\) is the range of \(f\).

Proof. Suppose that \(f\) were not injective, i.e., there exists \(\vx , \vx ^{\prime }\) such that \(f(\vx ) = f(\vx ^{\prime })\) while \(\vx \neq \vx ^{\prime }\). Define \(\vv \doteq (\vx ^{\prime } - \vx )/\norm {\vx ^{\prime } - \vx }_{2}\). Define the function \(g \colon \R \to \R \) as \(g(t) \doteq \vv ^{\top }f(\vx + t\vv )\). Then \(g(0) = \vv ^{\top }f(\vx ) = \vv ^{\top }f(\vx ^{\prime }) = g(\norm {\vx ^{\prime } - \vx }_{2})\). Since \(f\) is differentiable, \(g\) is differentiable, so the derivative \(g^{\prime }\) must vanish for some \(t^{\ast } \in (0, \norm {\vx ^{\prime } - \vx }_{2})\) by the mean value theorem. However,

\begin{equation} g^{\prime }(t^{\ast }) \doteq \vv ^{\top }\bs {f^{\prime }(\vx + t^{\ast }\vv )}\vv > 0 \tag{B.2.106} \end{equation}
since the Jacobian is positive definite. Thus we arrive at a contradiction, as claimed. □

Combining the above two results, we obtain the following crucial result.

Corollary B.1 (Generalization of [Gri11] Corollary A.2, Part 2). Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). Let \(s, t \in [0, T]\) be such that \(0 \leq s < t \leq T\), and let \(\bar {\vx }(\vxi ) \doteq \Ex [\vx _{s} \mid \vx _{t} = \vxi ]\). Then \(\bar {\vx }\) is injective, and therefore invertible onto its range.

Proof. The only thing left to show is that \(\bar {\vx }\) is differentiable, but this is immediate from Tweedie’s formula (Theorem 3.2) which shows that \(\bar {\vx }\) is differentiable if and only if \(\nabla \log p_{t}\) is differentiable, and this is provided by Equation (B.2.3). □

Controlling the Laplacian \(\Delta \log p_{t}\)

Finally, we develop a technical estimate which is required for the proof of Theorem B.3 and actually motivates the assumption for the viable \(t\).

Lemma B.5. Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). Let \(p_{t}\) be the density of \(\vx _{t}\). Then, for \(t > 0\) it holds

\begin{equation} \sup _{\vxi \in \R ^{D}}\abs {\Delta \log p_{t}(\vxi )} \leq \max \rp {\frac {D}{t^{2}}, \left\lvert \frac {R^{2}}{t^{4}} - \frac {D}{t^{2}}\right\rvert}. \tag{B.2.107} \end{equation}

Proof. By chain rule, a simple exercise computes

\begin{equation} \Delta \log p_{t}(\vxi ) = \frac {\Delta p_{t}(\vxi )}{p_{t}(\vxi )} - \frac {\norm {\nabla p_{t}(\vxi )}_{2}^{2}}{p_{t}(\vxi )^{2}}. \tag{B.2.108} \end{equation}
Using Proposition B.1 to write the terms in \(\Delta p_{t}(\vxi )\), we obtain
\begin{align} \frac {\Delta p_{t}(\vxi )}{p_{t}(\vxi )} &= \frac {\Ex \rs {\frac {\norm {\vxi - \vx }_{2}^{2} - Dt^{2}}{t^{4}} \cdot \phi _{t}(\vxi - \vx )}}{\Ex [\phi _{t}(\vxi - \vx )]} \tag{B.2.109} \\ &= \frac {\int _{\R ^{D}}\bc {\frac {\norm {\vxi - \vu }_{2}^{2} - Dt^{2}}{t^{4}}}\phi _{t}(\vxi - \vu )p(\vu )\odif {\vu }}{\int _{\R ^{D}}\phi _{t}(\vxi - \vu )p(\vu )\odif {\vu }}. \tag{B.2.110} \end{align}

This looks like a Bayesian marginalization, so let us define the appropriate normalized density

\begin{equation} q_{\vxi }(\vu ) = \frac {\phi _{t}(\vxi - \vu )p(\vu )}{\int _{\R ^{D}}\phi _{t}(\vxi - \vv )p(\vv )\odif {\vv }} = \frac {\phi _{t}(\vxi - \vu )p(\vu )}{\Ex [\phi _{t}(\vxi - \vx )]} = \frac {\phi _{t}(\vxi - \vu )p(\vu )}{p_{t}(\vxi )} \tag{B.2.111} \end{equation}
Then, defining \(\vy _{\vxi } \sim q_{\vxi }\), we can write
\begin{equation} \frac {\Delta p_{t}(\vxi )}{p_{t}(\vxi )} = \int _{\R ^{D}}\bc {\frac {\norm {\vxi - \vu }_{2}^{2} - Dt^{2}}{t^{4}}}q_{\vxi }(\vu )\odif {\vu } = \frac {1}{t^{4}}\Ex [\norm {\vxi - \vy _{\vxi }}_{2}^{2}] - \frac {D}{t^{2}}. \tag{B.2.112} \end{equation}
Similarly, writing out the second term (non-squared) we obtain
\begin{equation} \frac {\nabla p_{t}(\vxi )}{p_{t}(\vxi )} = -\frac {\vxi - \Ex [\vy _{\vxi }]}{t^{2}}. \tag{B.2.113} \end{equation}
Letting \(\vz _{\vxi } \doteq \vy _{\vxi } - \vxi \), it holds
\begin{equation} \frac {\Delta p_{t}(\vxi )}{p_{t}(\vxi )} = \frac {\Ex [\norm {\vz _{\vxi }}_{2}^{2}]}{t^{4}} - \frac {D}{t^{2}}, \qquad \frac {\nabla p_{t}(\vxi )}{p_{t}(\vxi )} = \frac {\Ex [\vz _{\vxi }]}{t^{2}}. \tag{B.2.114} \end{equation}
Thus writing \(\Delta \log p_{t}\) out fully, we have
\begin{align} \Delta \log p_{t}(\vxi ) &= \frac {\Ex [\norm {\vz _{\vxi }}_{2}^{2}]}{t^{4}} - \frac {D}{t^{2}} - \frac {\norm {\Ex [\vz _{\vxi }]}_{2}^{2}}{t^{4}} \tag{B.2.115} \\ &= \frac {\Ex [\norm {\vz _{\vxi }}_{2}^{2}] - \norm {\Ex [\vz _{\vxi }]}_{2}^{2}}{t^{4}} - \frac {D}{t^{2}} \tag{B.2.116} \\ &= \frac {\tr (\Cov (\vz _{\vxi }))}{t^{4}} - \frac {D}{t^{2}} \tag{B.2.117} \\ &= \frac {\tr (\Cov (\vy _{\vxi }))}{t^{4}} - \frac {D}{t^{2}}. \tag{B.2.118} \end{align}

A trivial lower bound on this trace is \(0\), since covariance matrices are positive semidefinite. To find an upper bound, note that \(\vy _{\vxi }\) takes values only in the support of \(\vx \) (since \(p\) is a factor of the density \(q_{\vxi }\) of \(\vy _{\vxi }\)), which by Assumption B.1 is a compact set \(\cS \) with radius \(R \doteq \sup _{\vxi \in \cS }\norm {\vxi }_{2}\). So

\begin{equation} \tr (\Cov (\vy _{\vxi })) = \Ex [\norm {\vy _{\vxi }}_{2}^{2}] - \norm {\Ex [\vy _{\vxi }]}_{2}^{2} \leq \Ex [\norm {\vy _{\vxi }}_{2}^{2}] \leq R^{2}. \tag{B.2.119} \end{equation}
Therefore
\begin{equation} -\frac {D}{t^{2}} \leq \Delta \log p_{t}(\vxi ) \leq \frac {R^{2}}{t^{4}} - \frac {D}{t^{2}}, \tag{B.2.120} \end{equation}
which shows the claim. □
Derivative Computations

Here we calculate some useful derivatives which will be reused throughout the appendix.

Proposition B.1. Let \(\vx \) be any random variable such that Assumptions B.1 and B.2 hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). For \(t \geq 0\), let \(p_{t}\) be the density of \(\vx _{t}\). Then

\begin{align} \pdv {p_{t}}{t}(\vxi ) &= \Ex \rs {\phi _{t}(\vxi - \vx )\cdot \frac {\norm {\vxi - \vx }_{2}^{2} - Dt^{2}}{t^{3}}} \tag{B.2.121} \\ \nabla p_{t}(\vxi ) &= -\Ex \rs {\phi _{t}(\vxi - \vx )\cdot \frac {\vxi - \vx }{t^{2}}} \tag{B.2.122} \\ \nabla ^{2}p_{t}(\vxi ) &= \Ex \rs {\phi _{t}(\vxi - \vx )\cdot \frac {(\vxi - \vx )(\vxi - \vx )^{\top } - t^{2}\vI }{t^{4}}} \tag{B.2.123} \\ \Delta p_{t}(\vxi ) &= \Ex \rs {\phi _{t}(\vxi - \vx )\cdot \frac {\norm {\vxi - \vx }_{2}^{2} - Dt^{2}}{t^{4}}}. \tag{B.2.124} \end{align}

Proof. We use the convolution representation of \(p_{t}\), namely (B.2.3). First taking the time derivative, a computation yields that Proposition B.3 applies,5 so we can bring the derivative inside the integral/expectation as:

\begin{equation} \pdv {p_{t}}{t}(\vxi ) = \frac{\partial}{\partial t} \Ex [\phi _{t}(\vxi - \vx )] = \Ex \rs {\frac{\partial}{\partial t} \phi _{t}(\vxi - \vx )} = \pdv {\phi _{t}}{t} * p. \tag{B.2.125} \end{equation}
Meanwhile, by properties of convolutions (Proposition B.4) and using the fact that \(p\) is compactly supported (Assumption B.1),
\begin{equation} p_{t} = \phi _{t} * p \implies \nabla p_{t} = \nabla \phi _{t} * p \implies \nabla ^{2}p_{t} = \nabla ^{2}\phi _{t} * p \implies \Delta p_{t} = \Delta \phi _{t} * p. \tag{B.2.126} \end{equation}
The rest of the computation follows from Proposition B.2. □

Proposition B.2. For \(t > 0\) and \(\vxi \in \R ^{D}\) it holds

\begin{align} \frac{\partial}{\partial t} \phi _{t}(\vxi ) &= \phi _{t}(\vxi ) \cdot \frac {\norm {\vxi }_{2}^{2} - Dt^{2}}{t^{3}} \tag{B.2.127} \\ \nabla \phi _{t}(\vxi ) &= -\phi _{t}(\vxi )\cdot \frac {\vxi }{t^{2}} \tag{B.2.128} \\ \nabla ^{2} \phi _{t}(\vxi ) &= \phi _{t}(\vxi )\cdot \frac {\vxi \vxi ^{\top } - t^{2}\vI }{t^{4}} \tag{B.2.129} \\ \Delta \phi _{t}(\vxi ) &= \phi _{t}(\vxi ) \cdot \frac {\norm {\vxi }_{2}^{2} - Dt^{2}}{t^{4}}. \tag{B.2.130} \end{align}

Proof. Direct computation. □

Differentiating Under the Integral Sign

In this appendix, we differentiate under the integral sign many times, and it is important to know when we can do this. There are two kinds of differentiating under the integral sign:

1.
Differentiating an integral \(\int f_{t}(\vxi )\odif {\vxi }\) with respect to the auxiliary parameter \(t\).
2.
Differentiating a convolution \((f * g)(\vxi ) = \int f(\vxi )g(\vxi - \vu )\odif {\vu }\) with respect to the variable \(\vxi \).

For the first category, we give a concrete result, stated without proof but attributable to the linked source, which derives the following result as a special case of a more general theorem about the interaction of differential operators and tempered distributions, much beyond the scope of the book. A full formal reference can be found in [Jon82].

Proposition B.3 ([Jon82], Section 11.12). Let \(f \colon (0, T) \times \R ^{D} \to \R \) be such that:

Then \(t \mapsto \int _{\R ^{D}}f_{t}(\vxi )\odif {\vxi }\) is an absolutely continuous function on \((0, T)\), and its derivative is

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}t} \int _{\R ^{D}}f_{t}(\vxi )\odif {\vxi } = \int _{\R ^{D}}\frac{\partial}{\partial t} f_{t}(\vxi )\odif {\vxi }, \tag{B.2.132} \end{equation}
defined for almost every \(t \in (0, T)\).

For the second category, we give another concrete result, stated without proof but fully formalized in [BB11].

Proposition B.4 ([BB11], Proposition 4.20). Let \(f\) be \(k\)-times continuously differentiable with compact support, and let \(g\) be locally integrable. Then the convolution \(f * g\) defined by

\begin{equation} (f * g)(\vxi ) \doteq \int _{\R ^{D}}f(\vu )g(\vxi - \vu )\odif {\vu } \tag{B.2.133} \end{equation}
is \(k\)-times continuously differentiable, and its derivative of order \(k\) is
\begin{equation} \nabla ^{k}(f * g) =(\nabla ^{k}f) * g. \tag{B.2.134} \end{equation}

Although not in the book, a simple integration by parts argument shows that if \(g\) is also \(k\)-times differentiable, then we can “trade off” the regularity:

\begin{equation} \nabla ^{k}(f * g) = f * (\nabla ^{k} g). \tag{B.2.135} \end{equation}

B.3 Lossy Coding and Sphere Packing

In this section, we prove Theorem 4.1. Following our conventions throughout this appendix, we write \(\cS = \Supp (\vx )\) for the compact support of the random variable \(\vx \).

As foreshadowed, we will make a regularity assumption on the support set \(\cS \) to prove Theorem 4.1. One possibility for proceeding under minimal assumptions would be to instantiate the results of [RBK18; RKB23] in our setting, since these results apply to sets \(\cS \) with very low regularity (e.g., Cantor-like sets with fractal structure). However, we have found precisely computing the constants in these results, a necessary endeavor to assert a conclusion like Theorem 4.1, to be somewhat onerous in our setting. Our approach is therefore to add a geometric regularity assumption on the set \(\cS \) that sacrifices some generality, but allows us to develop a more transparent argument. To avoid sacrificing too much generality, we must ensure that low-dimensionality in the set \(\cS \) is not prohibited. We therefore consider the running example we have used throughout the book, the mixture of low-rank Gaussian distributions. In this geometric setting, we model \(\cS \) as a union of hyperspheres, each of dimension \(d_k\) (possibly much smaller than \(D\)), living in mutually orthogonal subspaces of \(\R ^D\). This is a geometric simplification of the Gaussian mixture model: each component’s support is approximated by a sphere in the subspace spanned by its principal directions. When the component dimensions \(d_k\) are large, this assumption is equivalent to a mixture of low-rank Gaussians assumption, by high-dimensional measure concentration. The CRATE models we develop and train in Chapters 5 and 8 have their subspace dimensions set compatibly with this assumption.

Assumption B.3. The support \(\cS \subset \R ^D\) of the random variable \(\vx \) is a finite union of \(K\) spheres, each with dimension \(d_k\), \(k \in [K]\). The probability that \(\vx \) is drawn from the \(k\)-th sphere is given by \(\pi _k \in [0, 1]\), and conditional on being drawn from the \(k\)-th sphere, \(\vx \) is uniformly distributed on that sphere. The supports satisfy that each sphere is mutually orthogonal with all others.

We proceed under the simplifying Assumption B.3 in order to simplify excessive technicality, and to connect to an important running example used throughout the monograph. We believe our results can be generalized to support \(\cS \) from the class of sets with positive reach with additional technical effort, but leave this for the future.

B.3.1 Proof of Relationship Between Rate Distortion and Covering

We briefly sketch the proof, then proceed to establishing three fundamental lemmas, then give the proof. The proof will depend on notions introduced in the sketch below.

Obtaining an upper bound on the rate distortion function (4.1.9) is straightforward: by the rate characterization (i.e., the rate distortion function is the minimum rate of a code for \(\vx \) with expected squared \(\ell ^2\) distortion \(\epsilon \)), upper bounding \(\cR _{\epsilon }(\x )\) only requires demonstrating one code for \(\x \) that achieves this target distortion, and any \(\epsilon \)-covering of \(\Supp (\x )\) achieves this, with rate equal to the base-\(2\) logarithm of the cardinality of the covering. The lower bound is more subtle. We make use of the Shannon lower bound, discussed in Remark 4.3: working out the constants in [LZ94, §III, (22)] gives a more precise version of the result quoted in Equation (4.1.14) (in bits, of course): for any random variable \(\vx \) with compact support and a density, it holds

\begin{equation}\tag{B.3.1}\label {eq:slb-appendix-1} \cR _{\epsilon }(\x ) \geq h(\x ) - \log \volume (B_{\epsilon }) + \log \left ( \frac { 2 } { D \Gamma (D/2) } \left ( \frac { D } { 2e } \right )^{D/2} \right ), \end{equation}

with entropy (etc.) in nats in this expression. The constant can be easily estimated using Stirling’s approximation. A quantitative form of Stirling’s approximation which is often useful gives for any \(x > 0\) [Jam15]

\begin{equation} \Gamma (x) \leq \sqrt {2\pi } x^{x - 1/2} e^{-x} e^{1/(12x)}. \tag{B.3.2} \end{equation}

We will apply this bound to \(\Gamma (D/2)\) in Equation (B.3.1). We get

\begin{align} \log \left ( \frac { 2 } { D \Gamma (D/2) } \left ( \frac { D } { 2e } \right )^{D/2} \right ) &\geq -\frac {1}{6D} + \log \left ( \frac {2}{D} \left ( \frac { D } { 2e } \right )^{D/2} \cdot \sqrt {\frac {D}{4\pi }} \left ( \frac {D}{2e} \right )^{-D/2} \right ) \tag{B.3.3} \\ &= -\frac {1}{6D} - \frac {1}{2}\log D\pi ,\tag{B.3.4}\label {eq:slb-constant-est} \end{align}

which we can take for the explicit value of the constant \(C_D\) in Equation (4.1.14). Summarizing the fully quantified Shannon lower bound (in bits):

\begin{equation}\tag{B.3.5}\label {eq:slb-appendix} \cR _{\epsilon }(\x ) \geq h(\x ) - \log _2 \volume (B_{\epsilon }) - O(\log D). \end{equation}

Now, the important constraint for our current purposes is that the Shannon lower bound requires the random variable \(\vx \) to have a density, which rules out many low-dimensional distributions of interest. But let us momentarily consider the situation when \(\vx \) does admit a density. The assumption that \(\vx \) is uniformly distributed on its support is easily formalized in this setting: for any Borel set \(A \subset \cS \), we have

\begin{equation} \Pr [\vx \in A] = \int _{A} \frac {1}{\volume (\cS )} \odif \vx . \tag{B.3.6} \end{equation}

Then the entropy \(h(\vx )\) is just

\begin{equation} h(\vx ) = \log _2 \volume (\cS ). \tag{B.3.7} \end{equation}

The proof then concludes with a lemma that relates the ratio \(\volume (\cS ) / \volume (B_{\epsilon })\) to the \(\epsilon \)-covering number of \(\cS \) by \(\epsilon \) balls.

To extend the program above to degenerate distributions satisfying Assumption B.3, our proof of the lower bound in Theorem 4.1 will leverage an approximation argument of the actual low-dimensional distribution \(\vx \) by “nearby” distributions which have densities, similarly but not exactly the same as the proof sketch preceding Theorem B.1. We will then link the parameter introduced in the approximating sequence to the distortion parameter \(\epsilon \) in order to obtain the desired conclusion in Theorem 4.1.

Definition B.1. Let \(\cS \) be a compact set. For any \(\delta > 0\), define the \(\delta \)-thickening of \(\cS \), denoted \(\cS _{\delta }\), by

\begin{equation} \cS _{\delta } = \bc{\vxi \in \R ^D \given \mathrm {dist}(\vxi , \cS ) \leq \delta }. \tag{B.3.8} \end{equation}

The distance function referenced in Definition B.1 is defined by

\begin{equation}\tag{B.3.9}\label {eq:dist-func} \dist (\vxi , \cS ) = \inf _{\vxi ' \in \cS } \left\Vert \vxi - \vxi '\right\Vert_2. \end{equation}

For a compact set \(\cS \), Weierstrass’s theorem implies that for any \(\vxi \in \R ^D\), there is always some \(\vxi ' \in \cS \) attaining the infimum in the distance function. Compactness of \(\cS _{\delta }\) follows readily from compactness of \(\cS \), so \(\volume (\cS _{\delta })\) is finite for any \(\delta > 0\). It is then possible to make the following definition of a thickened random variable, specialized to Assumption B.3.

Definition B.2. Let \(\vx \) be a random variable such that \(\Supp (\vx ) = \cS \) is a union of \(K\) hyperspheres, distributed as in Assumption B.3. Denote the support of each component of the mixture by \(\cS _k\). Define the thickened random variable \(\vx _{\delta }\) as the mixture of measures where each component measure is uniform on the thickened set \(\cS _{k, \delta }\) (Definition B.1), for \(k \in [K]\), with mixing weights \(\pi _k\).

Lemma B.6. Suppose the random variable \(\vx \) satisfies Assumption B.3. Then if \(0 < \delta < \tfrac {1}{2}\), the thickened random variable \(\vx _{\delta }\) (Definition B.2) satisfies for any \(\epsilon > 0\)

\begin{equation} R_{\delta + \epsilon }(\vx _{\delta }) \leq R_{\epsilon }(\vx ). \tag{B.3.10} \end{equation}

The proof of Lemma B.6 is deferred to Section B.3.2. Using Lemma B.6, the above program can be realized, because the random variable \(\vx _{\delta }\) has a density that is uniform with respect to the Lebesgue measure.

(Proof of Theorem 4.1). The upper bound is readily shown. If \(S\) is any \(\epsilon \)-cover of the support of \(\vx \) with cardinality \(\cN _{\epsilon }(\Supp (\vx ))\), then consider the coding scheme assigning to each \(\vxi \in \Supp (\vx )\) the reconstruction \(\hat {\vxi } = \argmin _{\vxi ' \in S}\, \norm {\vxi - \vxi '}_2\), with ties broken arbitrarily. Then ties occur with probability zero, and the fact that \(S\) covers \(\Supp (\vx )\) at scale \(\epsilon \) guarantees distortion no larger than \(\epsilon \); the rate of this scheme is \(\log _2 \cN _{\epsilon }(\Supp (\vx ))\).

For the lower bound, let \(0 < \delta < \tfrac {1}{2}\), and consider the thickened random variable \(\vx _{\delta }\). By Lemma B.6, we have

\begin{equation} R_{\delta + \epsilon }(\vx _{\delta }) \leq R_{\epsilon }(\vx ). \tag{B.3.11} \end{equation}
Since \(\vx _{\delta }\) has a Lebesgue density that is uniform, we can then apply the Shannon lower bound, in the form (B.3.5), to get
\begin{equation}\tag{B.3.12}\label {eq:slb-proof} \log _2 \volume (\Supp (\vx _{\delta })) - \log _2 \volume (B_{\delta + \epsilon }) - O(\log D) \leq R_{\epsilon }(\vx ). \end{equation}
Finally, we need to lower bound the ratio
\begin{equation} \frac { \volume (\Supp (\vx _{\delta })) } { \volume (B_{\delta + \epsilon }) } \tag{B.3.13} \end{equation}
in terms of the covering number. Since \(\Supp (\vx _{\delta }) = \Supp (\vx ) + B_{\delta }\), where \(+\) here denotes the Minkowski sum, a standard application of volume bound arguments (see e.g. [Ver18, Proposition 4.2.12]) gives
\begin{equation} \volume (\Supp (\vx _{\delta })) \geq \cN _{2\delta }(\Supp (\vx )) \volume (B_{\delta }). \tag{B.3.14} \end{equation}
Hence
\begin{align} \frac { \volume (\Supp (\vx _{\delta })) } { \volume (B_{\delta + \epsilon }) } &\geq \cN _{2\delta }(\Supp (\vx )) \frac { \volume (B_{\delta }) } { \volume (B_{\delta + \epsilon }) } \tag{B.3.15} \\ &= \cN _{2\delta }(\Supp (\vx )) \left ( \frac { \delta } { \delta + \epsilon } \right )^D. \tag{B.3.16} \end{align}

Choosing \(\delta = \epsilon / 2\) gives from the Shannon lower bound (B.3.12) and the above estimates

\begin{equation} \log _2 \cN _{\epsilon }(\Supp (\vx )) - O(D) \leq R_{\epsilon }(\vx ), \tag{B.3.17} \end{equation}
as was to be shown. □

B.3.2 Proof of Lemma B.6

(Proof of Lemma B.6). It suffices to show that any code for \(\vx \) with expected squared distortion \(\epsilon ^2\) produces a code for \(\vx _{\delta }\) with the same rate and distortion not much larger, for a suitable choice of \(\delta \). So fix such a code for \(\vx \), achieving rate \(R\) and expected squared distortion \(\epsilon ^2\). We write \(\hat {\vx }\) for the reconstructed random variable using this code, and \(\mathrm {q} : \Supp (\vx ) \to \Supp (\vx )\) for the associated encoding-decoding mapping (i.e., \(\hat {\vx } = \mathrm {q}(\vx )\)).

Now let \(\cS _k\) denote the \(k\)-th hypersphere in the support of \(\vx \). There is an orthonormal basis \(\vU _{k} \in \R ^{D \times d_k}\) such that \(\Span (\cS _k) = \Span (\vU _k)\). The following orthogonal decomposition of the support set \(\cS \) will be used repeatedly throughout the proof. We have

\begin{align} \cS _{\delta } &= \set {\vxi \in \R ^D \given \exists k \in [K] \::\: \dist (\vxi , \cS _k) \leq \delta } \tag{B.3.18} \\ &= \bigcup _{k \in [K]} \set {\vxi \in \R ^D \given \dist (\vxi , \cS _k) \leq \delta }. \tag{B.3.19} \end{align}

By orthogonal projection, for any \(k \in [K]\) any \(\vxi \in \R ^D\) can be written as \(\vxi = \vxi ^{\|} + \vxi ^{\perp }\), with \(\vxi ^{\|} \in \Span (\cS _k)\) and \(\ip {\vxi ^{\|}}{\vxi ^\perp } = 0\). Then for any \(\vxi ' \in \cS _k\), we have

\begin{align} \left\Vert \vxi - \vxi '\right\Vert_2^2 = \left\Vert \vxi ^{\|} + \vxi ^{\perp } - \vxi '\right\Vert_2^2 &= \left\Vert \vxi ^{\|}\right\Vert_2^2 + \left\Vert \vxi ^{\perp }\right\Vert_2^2 + \left\Vert \vxi '\right\Vert_2^2 - 2 \left\langle \vxi ^{\|}, \vxi '\right\rangle \tag{B.3.20} \\ &\geq \left\Vert \vxi ^{\|}\right\Vert_2^2 + \left\Vert \vxi '\right\Vert_2^2 - 2 \left\langle \vxi ^{\|}, \vxi '\right\rangle \tag{B.3.21} \\ &= \left\Vert \vxi ^{\|} - \vxi '\right\Vert_2^2. \tag{B.3.22} \end{align}

Further, it is known that for any nonzero \(\vxi ^{\|} \in \Span (\cS _k)\),

\begin{equation} \inf _{\vxi ' \in \cS _k}\, \left\Vert \vxi ^{\|} - \vxi '\right\Vert_2^2 = \left\Vert \vxi ^{\|} - \frac {\vxi ^{\|}}{\norm {\vxi ^{\|}}_2}\right\Vert_2^2. \tag{B.3.23} \end{equation}
If \(\vxi ^{\|}\) is zero, it is clear that the above distance is equal to \(1\) for every \(\vxi ' \in \cS _k\). Hence, if we define a projection mapping \(\pi _{\cS _k}(\vxi )\) by
\begin{equation} \pi _{S_k}(\vxi ) = \frac {\vU _k\vU _k^\top \vxi }{\norm {\vU _k^\top \vxi }_2} \tag{B.3.24} \end{equation}
for any \(\vxi \in \R ^D\) with \(\vU _k^\top \vxi \neq \mathbf {0}\), then \(\pi _{\cS _k}(\vxi ) = \argmin _{\vxi ' \in \cS _k}\left\Vert \vxi ' - \vxi \right\Vert_2\). We choose \(0 < \delta < 1\), so that the thickened set \(\cS _{\delta }\) contains no points \(\vxi \in \R ^D\) at which any of the projection maps \(\pi _{\cS _k}\) is not well-defined. So the thickened set \(\cS _{\delta }\) satisfies
\begin{align} \cS _{\delta } &= \bigcup _{k \in [K]} \bc{\vxi \in \R ^D \given \left\Vert \vxi - \frac {\vU _k\vU _k^\top \vxi }{\norm {\vU _k^\top \vxi }_2}\right\Vert_2 \leq \delta }. \tag{B.3.25} \end{align}

These distances can be rewritten in terms of the orthogonal decomposition as

\begin{align} \left\Vert \vxi - \frac {\vU _k\vU _k^\top \vxi }{\norm {\vU _k^\top \vxi }_2}\right\Vert_2^2 &= \left\Vert \vxi \right\Vert_2^2 - 2 \norm {\vU _k^\top \vxi }_2 + 1 \tag{B.3.26} \\ &= \left\Vert \vxi ^{\|}\right\Vert_2^2 + \left\Vert \vxi ^{\perp }\right\Vert_2^2 - 2 \norm {\vU _k^\top \vxi ^{\|}}_2 + 1 \tag{B.3.27} \\ &= \left\Vert \vxi ^{\perp }\right\Vert_2^2 + \left ( \left\Vert \vxi ^{\|}\right\Vert_2 - 1 \right )^2. \tag{B.3.28}\label {eq:distance-to-hypersphere} \end{align}

We are going to show next that every such \(\vxi \in \cS _{\delta }\) can be uniquely associated to a projection onto a single subspace in the mixture, which will allow us to define a corresponding projection onto \(\cS \). Given a \(\vxi \in \cS _{\delta }\), by the above, we can find a subspace \(\vU _k\) such that the orthogonal decomposition \(\vxi = \vxi ^{\|}_k + \vxi ^{\perp }_k\) satisfies

\begin{equation} \left\Vert \vxi ^{\perp }_k\right\Vert_2^2 + \left ( \left\Vert \vxi ^{\|}_k\right\Vert_2 - 1 \right )^2 \leq \delta ^2. \tag{B.3.29} \end{equation}
Consider the decomposition \(\vxi = \vxi _j^{\|} + \vxi _j^{\perp }\) for some \(j \neq k\). We have
\begin{align} \left\Vert \vxi _j^{\|}\right\Vert_2 = \left\Vert \vU _j \vU _j^\top \vxi \right\Vert_2 &= \left\Vert \vU _j \vU _j^\top ( \vU _k \vU _k^\top \vxi + (\vI - \vU _k \vU _k^\top ) \vxi ) \right\Vert_2 \tag{B.3.30} \\ &= \left\Vert \vU _j \vU _j^\top (\vI - \vU _k \vU _k^\top ) \vxi \right\Vert_2 \tag{B.3.31} \\ &\leq \left\Vert (\vI - \vU _k \vU _k^\top ) \vxi \right\Vert_2 = \left\Vert \vxi _k^{\perp } \right\Vert_2 \leq \delta , \tag{B.3.32} \end{align}

where the second line uses the orthogonality assumption on the subspaces \(\vU _k\), and the third uses the fact that orthogonal projections are nonexpansive. Hence, the \(j\)-th distance satisfies

\begin{equation} \left\Vert \vxi ^{\perp }_j\right\Vert_2^2 + \left ( \left\Vert \vxi ^{\|}_j\right\Vert_2 - 1 \right )^2 \geq (1 - \delta )^2. \tag{B.3.33} \end{equation}
This implies that if \(0 < \delta < 1/2\), every \(\vxi \in \cS _{\delta }\) has a unique closest subspace in the mixture. Hence, under this condition, the following mapping \(\pi _{\cS } : \cS _{\delta } \to \cS \) is well-defined:
\begin{equation} \pi _{\cS }(\vxi ) = \pi _{\cS _{k_\star }}(\vxi ), \enspace \text {where}\enspace k_{\star } = \argmin _{k \in [K]}\, \dist (\vxi , \cS _{k}). \tag{B.3.34} \end{equation}

Now, we define a code for \(\vx _{\delta }\) by

\begin{equation} \hat {\vx }_{\delta } = \mathrm {q}( \pi _{\cS }(\vx _{\delta }) ). \tag{B.3.35} \end{equation}
Clearly this is associated to a rate-\(R\) code for \(\vx _{\delta }\), because it uses the encoding-decoding mappings from the rate-\(R\) code for \(\vx \). We have to show that it achieves small distortion. We calculate
\begin{align} \bE \left [ \left\Vert \vx _{\delta } - \hat {\vx }_{\delta } \right\Vert_2^2 \right ] &= \bE \left [ \left\Vert \vx _{\delta } - \mathrm {q}(\pi _{\cS }(\vx _{\delta })) \right\Vert_2^2 \right ] \tag{B.3.36} \\ &\leq \left ( \bE \left [ \left\Vert \vx _{\delta } - \pi _{\cS }(\vx _{\delta }) \right\Vert_2^2 \right ]^{1/2} + \bE \left [ \left\Vert \pi _{\cS }(\vx _{\delta }) - \mathrm {q}(\pi _{\cS }(\vx _{\delta })) \right\Vert_2^2 \right ]^{1/2} \right )^2,\tag{B.3.37}\label {eq:distortion-decomposition} \end{align}

where the inequality uses the Minkowski inequality. Now, by Definitions B.1 and B.2, we have deterministically

\begin{equation} \left\Vert \vx _{\delta } - \pi _{\cS }(\vx _{\delta }) \right\Vert_2^2 \leq \delta ^2, \tag{B.3.38} \end{equation}
so the expectation also satisfies this estimate. For the second term, it will suffice to characterize the density of the random variable \(\pi _{\cS }(\vx _{\delta })\) as being sufficiently close to the density of \(\vx \)—which, as Assumption B.3 implies, is a mixture of uniform distributions on each sub-sphere \(\cS _k\). By the argument above, every point \(\vxi \in \cS _{\delta }\) can be associated to one and only one subspace \(\vU _k\), which means that the mixture components in the definition of \(\cS _{\delta }\) (recall Definition B.2) do not overlap. Hence, the density \(\pi _{\cS }(\vx _{\delta })\) can be characterized by studying the effect of \(\pi _{\cS _k}\) on the conditional random variable \(\vx _{\delta }\), conditioned on being drawn from \(\cS _{k, \delta }\). Denote this measure by \(\mu _{k, \delta }\). We claim that the pushforward of this measure under \(\pi _{\cS _k}\) is uniform on \(\cS _k\). To see that this holds, we recall Equation (B.3.21), which gives the characterization
\begin{equation} \cS _{k, \delta } = \bc{ \vxi ^{\|} + \vxi ^{\perp } \given \vxi ^{\|} \in \Span (\vU _k), \vxi ^{\perp } \in \Span (\vU _k)^\perp , \left\Vert \vxi ^{\perp }\right\Vert_2^2 + \left ( \left\Vert \vxi ^{\|}\right\Vert_2 - 1 \right )^2 \leq \delta ^{2} }. \tag{B.3.39} \end{equation}
The conditional distribution in question is uniform on this set; we need to show that the projection \(\pi _{\cS _k}\) applied to this conditional random variable yields a random variable that is uniform on \(\cS _k\). With respect to these coordinates, we have seen that \(\pi _{\cS _k}(\vxi ^\| + \vxi ^\perp ) = \vxi ^\| / \norm {\vxi ^{\|}}_2\). Hence, for any \(\vxi \in \cS _{k}\), we have that the preimage of \(\vxi \) in \(\cS _{k, \delta }\) under \(\pi _{\cS _k}\) is
\begin{equation} \pi _{\cS _k}^{-1}(\vxi ) = \bc{ r\vxi + \vxi ^\perp \given r > 0, \vxi ^{\perp } \in \Span (\vU _k)^\perp , \left\Vert \vxi ^{\perp }\right\Vert_2^2 + \left ( r - 1 \right )^2 \leq \delta ^{2} }. \tag{B.3.40}\label {eq:fiber-of-uos} \end{equation}
To show that \((\pi _{\cS _k})_{\sharp } \mu _{k, \delta }\) is uniform, we need to decompose the integral of the uniform density on \(\cS _{k, \delta }\) in a way that makes it clear that each of the fibers \(\pi _{\cS _k}^{-1}(\vxi )\) (for each \(\vxi \in \cS _k\)) “contributes” equally to the integral.6 We have by Definition B.2
\begin{equation} \volume (\cS _{k, \delta }) = \iint _{\Span (\vU _k) \times \Span (\vU _k)^\perp } \mathbf {1}_{ \left\Vert \vxi ^{\perp }\right\Vert_2^2 + \left ( \left\Vert \vxi ^{\|}\right\Vert_2 - 1 \right )^2 \leq \delta ^{2} } \odif \vxi ^{\|} \odif \vxi ^{\perp }. \tag{B.3.41} \end{equation}
In particular, the integration over the orthogonal coordinates factors. Let \(\odif \vtheta ^{d}\) denote the uniform (Haar) measure on the sphere of radius \(1\) in \(\R ^d\). Converting the \(\vxi ^{\|}\) integral to polar coordinates, we have
\begin{equation} \volume (\cS _{k, \delta }) = \int _{[0, \infty )} \int _{\bS ^{d_k-1}} \int _{\Span (\vU _k)^\perp } r^{d_k-1} \mathbf {1}_{ \left\Vert \vxi ^{\perp }\right\Vert_2^2 + \left ( r - 1 \right )^2 \leq \delta ^{2} } \odif r \odif \vtheta ^{d_k} \odif \vxi ^{\perp }. \tag{B.3.42} \end{equation}
Comparing to the fiber representation (B.3.40), we see that we need to “integrate out” over the \(r\) and \(\vxi ^\perp \) components of the preceding integral in order to verify that the pushforward is uniform. But this is evident, as the previous expression shows that the value of this integral is independent of \(\vxi ^\|\)—or, equivalently in context, the value of the spherical component \(\vtheta ^{d_k}\).

Thus it follows from the above argument that \(\pi _{\cS }(\vx _{\delta })\) is uniform. Because the assumption on \(\delta \) implies that the mixture components in the distribution of \(\vx _{\delta }\) do not overlap, the mixing weights \(\pi _k\) are also preserved in the image \(\pi _{\cS }(\vx _{\delta })\), and in particular, the distribution of \(\pi _{\cS }(\vx _{\delta })\) is equal to the distribution of \(\vx \). Hence the second term in Equation (B.3.27) satisfies

\begin{equation} \bE \left [ \left\Vert \pi _{\cS }(\vx _{\delta }) - \mathrm {q}(\pi _{\cS }(\vx _{\delta })) \right\Vert_2^2 \right ] = \bE \left [ \left\Vert \vx - \mathrm {q}(\vx ) \right\Vert_2^2 \right ] \leq \epsilon ^2, \tag{B.3.43} \end{equation}
because \(\mathrm {q}\) is a distortion-\(\epsilon \) code for \(\vx \).

We have thus shown that the hypothesized rate-\(R\), (expected squared) distortion-\(\epsilon ^2\) code for \(\vx \) produces a rate-\(R\), (expected squared) distortion \(\delta + \epsilon \) code for \(\vx _{\delta }\). This establishes that

\begin{equation} R_{\delta + \epsilon }(\vx _{\delta }) \leq R_{\epsilon }(\vx ), \tag{B.3.44} \end{equation}
as was to be shown. □