Appendix B Entropy, Diffusion, Denoising, and Lossy Coding

In this section

Differential Entropy of Low-Dimensional Distributions
Diffusion and Denoising Processes
Diffusion Process Increases Entropy Over Time Denoising Process Reduces Entropy Over Time Technical Lemmas and Intermediate Results
Lossy Coding and Sphere Packing
Proof of Relationship Between Rate Distortion and Covering Proof of Lemma B.6

“The increase of disorder or entropy with time is one example of what is called an arrow of time, something that distinguishes the past from the future, giving a direction to time.”

– A Brief History of Time, Stephen Hawking

In this appendix we provide proofs for several facts, mentioned in Chapter 3, which are related to differential entropy, how it evolves under diffusion processes, and its connections to lossy coding. We will make the following mild assumption about the random variable representing the data source, denoted $\bm{x}$ .

Assumption B.1.

$\bm{x}$ is supported on a compact set $\mathcal{S}\subseteq\mathbb{R}^{D}$ of radius at most $R$ , i.e., $R\doteq\sup_{\bm{\xi}\in\mathcal{S}}\|\bm{\xi}\|_{2}$ .

In particular, since compact sets in Euclidean space are bounded, it holds $R<\infty$ . We will consistently use the notation $B_{r}(\bm{\xi})\doteq\{\bm{u}\in\mathbb{R}^{D}\colon\|\bm{\xi}-\bm{u}\|_{2}\leq r\}$ to denote the Euclidean ball of radius $r$ centered at $\bm{\xi}$ . In this sense, Assumption B.1 has $\mathcal{S}\subseteq B_{R}(\bm{0})$ .

Notice that this assumption holds for (almost) all variables we care about in practice, as it is (often) imposed by a normalization step during data pre-processing.

B.1 Differential Entropy of Low-Dimensional Distributions

In this short appendix we discuss the differential entropy of low-dimensional distributions. By definition, the differential entropy of a random variable $\bm{x}$ which does not have a density is $-\infty$ ; this includes all random variables supported on low-dimensional sets. The objective of this section is to discuss why this is a “morally correct” value.

In fact, let $\bm{x}$ be any random variable such that Assumption B.1 holds, the support $\mathcal{S}$ of $\bm{x}$ has $0$ volume.¹¹1Formally this means that $\mathcal{S}$ is Borel measurable with Borel measure $0$ . We will consider the case that $\bm{x}$ is uniform on $\mathcal{S}$ .²²2Say, w.r.t. the Hausdorff measure on $\mathcal{S}$ . Our goal is to compute $h(\bm{x})$ .

In this case, $\bm{x}$ would not have a density; in the counterfactual world where we did not know $h(\bm{x})=-\infty$ , we could not directly define it using the standard definition of differential entropy. Instead, as in the rest of analysis and information theory it would be reasonable to consider the limit of entropies of successively better approximations $\bm{x}_{\varepsilon}$ of $\bm{x}$ which have densities, i.e.,

h(\bm{x})\ \text{``=''}\ \lim_{\varepsilon\searrow 0}h(\bm{x}_{\varepsilon}).

(B.1.1)

To this end, the basic idea is to take an $\varepsilon$ -thickening of $\mathcal{S}$ , say $\mathcal{S}_{\varepsilon}$ defined as

S_{\varepsilon}=\bigcup_{\bm{\xi}\in\mathcal{S}}B_{\varepsilon}(\bm{\xi})

(B.1.2)

and visualized in Figure B.1.

Figure B.1: Illustration of the

\varepsilon

-thickening

\mathcal{S}_{\varepsilon}

of a curve

\mathcal{S}\subseteq\mathbb{R}^{2}

We will work with random variables whose support is $\mathcal{S}_{\varepsilon}$ , which is fully-dimensional, and take the limit as $\varepsilon\to 0$ . Indeed, define $\bm{x}_{\varepsilon}\sim\operatorname{\mathcal{U}}(\mathcal{S}_{\varepsilon})$ . Since $\mathcal{S}_{\varepsilon}$ has positive volume, $\bm{x}_{\varepsilon}$ has a density $p_{\varepsilon}$ equal to

p_{\varepsilon}(\bm{\xi})=\mathbf{1}(\bm{\xi}\in\mathcal{S}_{\varepsilon})\cdot\frac{1}{\operatorname{vol}(\mathcal{S}_{\varepsilon})}.

(B.1.3)

Computing the entropy of $\bm{x}_{\varepsilon}$ using the convention that $0\log 0=0$ , it holds

$\displaystyle h(\bm{x}_{\varepsilon})$	$\displaystyle=-\int_{\mathbb{R}^{D}}p_{\varepsilon}(x)\log p_{\varepsilon}(x)\mathrm{d}x$	(B.1.4)
	$\displaystyle=-\int_{\mathcal{S}_{\varepsilon}}\frac{1}{\operatorname{vol}(\mathcal{S}_{\varepsilon})}\log\left(\frac{1}{\operatorname{vol}(\mathcal{S}_{\varepsilon})}\right)\mathrm{d}\bm{\xi}$	(B.1.5)
	$\displaystyle=\frac{\log(\operatorname{vol}(\mathcal{S}_{\varepsilon}))}{\operatorname{vol}(\mathcal{S}_{\varepsilon})}\int_{\mathcal{S}_{\varepsilon}}\mathrm{d}\bm{\xi}$	(B.1.6)
	$\displaystyle=\log(\operatorname{vol}(\mathcal{S}_{\varepsilon})).$	(B.1.7)

Since $\mathcal{S}$ is compact $\operatorname{vol}(\mathcal{S}_{\varepsilon})$ is finite and tends to $0$ as $\varepsilon\searrow 0$ . Thus

h(\bm{x})=\lim_{\varepsilon\searrow 0}h(\bm{x}_{\varepsilon})=\lim_{\varepsilon\searrow 0}\log(\operatorname{vol}(\mathcal{S}_{\varepsilon}))=-\infty,

(B.1.8)

as desired.

The above calculation is actually a corollary of a much more famous and celebrated set of results about the maximum possible entropy of $\bm{x}$ subject to certain constraints on the distribution of $\bm{x}$ . We would be remiss to not provide the results here; the proofs are provided in Chapter 2 of [PW22], for example.

Theorem B.1.

Let $\bm{x}$ be a random variable on $\mathbb{R}^{D}$ .

If $\bm{x}$ is supported on a compact set $\mathcal{S}\subseteq\mathbb{R}^{D}$ (i.e., Assumption B.1) then

h(\bm{x})\leq h(\operatorname{\mathcal{U}}(\mathcal{S}))=\log\operatorname{vol}(\mathcal{S}).

(B.1.9)

If $\bm{x}$ has finite covariance such that, for a PSD matrix $\bm{\Sigma}\in\mathsf{PSD}(D)$ , it holds $\operatorname{Cov}(\bm{x})\preceq\bm{\Sigma}$ (w.r.t. the PSD ordering, i.e., $\bm{\Sigma}-\operatorname{Cov}(\bm{x})$ is PSD), then

h(\bm{x})\leq h(\operatorname{\mathcal{N}}(\bm{0},\bm{\Sigma}))=\frac{1}{2}\log((2\pi e)^{D}\det\bm{\Sigma}).

(B.1.10)

If $\bm{x}$ has finite second moment such that, for a constant $a\geq 0$ , it holds $\operatorname{\mathbb{E}}\|\bm{x}\|_{2}^{2}\leq a$ , then

h(\bm{x})\leq h\left(\operatorname{\mathcal{N}}\left(\bm{0},\frac{a}{D}\bm{I}\right)\right)=\frac{D}{2}\log\frac{2\pi ea}{D}.

(B.1.11)

B.2 Diffusion and Denoising Processes

In the main body (Chapter 3), we considered a random variable $\bm{x}$ , and a stochastic process defined by (3.2.1), i.e.,

\bm{x}_{t}=\bm{x}+t\bm{g},\qquad\forall t\in[0,T]

(B.2.1)

where $\bm{g}\sim\operatorname{\mathcal{N}}(\bm{0},\bm{I})$ independently of $\bm{x}$ .

The structure of this section is as follows. In Section B.2.1 we provide a formal theorem and crisp proof which shows that under Equation B.2.1 the entropy increases, i.e., $\frac{\mathrm{d}}{\mathrm{d}t}h(\bm{x}_{t})>0$ . In Section B.2.2 we provide a formal theorem and crisp proof which shows that under Equation B.2.1, the entropy decreases during denoising, i.e., $h(\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}])<h(\bm{x}_{t})$ for all $s<t$ . In Section B.2.3 we provide proofs for technical lemmas that are needed to establish the claims in the previous subsections.

Before we start, we introduce some key notations. First, let $\varphi_{t}$ be the density of $\operatorname{\mathcal{N}}(\bm{0},t^{2}\bm{I})$ , i.e.,

\varphi_{t}(\bm{\xi})\doteq\frac{1}{(2\pi)^{D/2}t^{D}}\exp\left(-\frac{\|\bm{\xi}\|_{2}^{2}}{2t^{2}}\right).

(B.2.2)

Next, $\bm{x}_{t}$ is supported on all of $\mathbb{R}^{D}$ , so it has a density, which we denote $p_{t}$ (as in the main body). A quick calculation shows that

p_{t}(\bm{\xi})=\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})],

(B.2.3)

and from this representation it is possible to deduce (i.e., from Proposition B.4) that $p_{t}$ is smooth (i.e., infinitely differentiable) in $\bm{\xi}$ , also smooth in $t$ , and positive everywhere. This fact is somewhat remarkable at first sight: even for a completely irregular random variable $\bm{x}$ (say, a Bernoulli random variable, which does not have a density), its Gaussian smoothing admits a density for every (arbitrarily small) $t>0$ . The proof is left as an exercise for readers well-versed in mathematical analysis.

However, we also need to add an assumption about the smoothness of the distribution of $\bm{x}$ , which will eliminate some technical problems that occur around $t=0$ with low-dimensional distributions.³³3As then various quantities become highly irregular and dealing with them would require significant additional analysis. Despite this, we expect that our results hold under milder assumptions with additional work. For now, let us assume:

Assumption B.2.

$\bm{x}$ has a twice continuously differentiable density, denoted $p$ .

B.2.1 Diffusion Process Increases Entropy Over Time

In this section appendix we provide a proof of Theorem B.2. For convenience, we restate it as follows.

Theorem B.2 (Diffusion Increases Entropy).

Let $\bm{x}$ be any random variable such that Assumptions B.1 and B.2 hold, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1). Then

h(\bm{x}_{s})<h(\bm{x}_{t}),\qquad\forall s,t\colon 0\leq s<t\leq T.

(B.2.4)

Proof.

Before we start, we must ask: when does the inequality in (B.2.4) make sense? We will show in Lemma B.1 that under our assumptions, the differential entropy is well-defined, is never $+\infty$ , and for $t>0$ is finite, so the (strict) inequality in (B.2.4) makes sense.

The question of well-definedness aside, the crux of this proof is to show that the density $p_{t}$ of $\bm{x}_{t}$ satisfies a particular partial differential equation, which is very similar to the heat equation. The heat equation is a famous PDE which describes the diffusion of heat through space. This intuitively should make sense, and paints a mental picture: as the time $t$ increases, the probability from the original (perhaps tightly concentrated) $\bm{x}$ disperses across all of $\mathbb{R}^{D}$ like heat radiating from a source in a vacuum.

Such PDEs for $p_{t}$ , known as Fokker-Planck equations for more general stochastic processes, are very powerful tools, as they allow us to describe the instantaneous temporal derivatives of $p_{t}$ in terms of the instantaneous spatial derivatives of $p_{t}$ , and vice versa, providing a concise description of the regularity and dynamics of $p_{t}$ . Once we obtain dynamics for $p_{t}$ , we can then use the system to obtain another one which describes the dynamics of $h(\bm{x}_{t})$ , which after all is just a functional of $p_{t}$ .

The description of the PDE involves a mathematical object called the Laplacian $\Delta$ . Recall from your multivariable calculus class that the Laplacian operating on a differentiable-in-time and twice-differentiable-in-space function $f\colon(0,T)\times\mathbb{R}^{D}\to\mathbb{R}$ is given by

\Delta f_{t}(\bm{\xi})=\operatorname{tr}(\nabla^{2}f_{t}(\bm{\xi}))=\sum_{i=1}^{D}\frac{\partial^{2}f_{t}}{\partial\xi_{i}^{2}}(\bm{\xi}).

(B.2.5)

Namely, from using the integral representation of $p_{t}$ and differentiating under the integral, we can compute the derivatives of $p_{t}$ (which we do in Proposition B.1) and observe that $p_{t}$ satisfies the heat-like PDE

\frac{\partial p_{t}}{\partial t}(\bm{\xi})=t\Delta p_{t}(\bm{\xi}).

(B.2.6)

Then for finding the dynamics of $h(\bm{x}_{t})$ , we can use Proposition B.3 again as well as the heat-like PDE to get

$\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}h(\bm{x}_{t})$	$\displaystyle=-\frac{\mathrm{d}}{\mathrm{d}t}\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\mathrm{d}\bm{\xi}$	(B.2.7)
	$\displaystyle=-\int_{\mathbb{R}^{D}}\frac{\partial}{\partial t}\left[p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\right]\mathrm{d}\bm{\xi}$	(B.2.8)
	$\displaystyle=-\int_{\mathbb{R}^{D}}\frac{\partial p_{t}}{\partial t}(\bm{\xi})[1+\log p_{t}(\bm{\xi})]\mathrm{d}\bm{\xi}$	(B.2.9)
	$\displaystyle=-t\int_{\mathbb{R}^{D}}\Delta p_{t}(\bm{\xi})[1+\log p_{t}(\bm{\xi})]\mathrm{d}\bm{\xi}.$	(B.2.10)

By using a slightly involved integration by parts argument (Lemma B.2), we obtain

$\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}h(\bm{x}_{t})$	$\displaystyle=t\int_{\mathbb{R}^{D}}\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\mathrm{d}\bm{\xi}$	(B.2.11)
	$\displaystyle=t\int_{\mathbb{R}^{D}}\frac{\\|\nabla p_{t}(\bm{\xi})\\|_{2}^{2}}{p_{t}(\bm{\xi})}\mathrm{d}\bm{\xi}$	(B.2.12)
	$\displaystyle>0$	(B.2.13)

where strict inequality holds in the last line because, for it to not hold, $\nabla p_{t}(\bm{\xi})$ would need to vanish almost everywhere (i.e., everywhere except possibly on a set of zero volume), but this would imply that $p_{t}$ would be constant almost everywhere, a contradiction with a fact that $p_{t}$ is a density.

To complete the proof we just use the fundamental theorem of calculus

h(\bm{x}_{t})=h(\bm{x}_{s})+\int_{s}^{t}\frac{\mathrm{d}}{\mathrm{d}u}h(\bm{x}_{u})\mathrm{d}u>h(\bm{x}_{s}),

(B.2.14)

which proves the claim. (Note that this does not make sense when $h(\bm{x}_{s})=-\infty$ , which can only happen when $s=0$ and $h(\bm{x})=-\infty$ , but in this case $h(\bm{x}_{t})>-\infty$ so the claim is vacuously true anyways.) ∎

B.2.2 Denoising Process Reduces Entropy Over Time

Recall that in Section 3.2.1 we start with the random variable $\bm{x}_{T}$ and iteratively denoise it using iterations of the form

\hat{\bm{x}}_{s}\doteq\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}=\hat{\bm{x}}_{t}]=\frac{s}{t}\hat{\bm{x}}_{t}+\left(1-\frac{s}{t}\right)\bar{\bm{x}}^{\ast}(t,\hat{\bm{x}}_{t}).

(B.2.15)

for $s,t\in\{t_{0},t_{1},\dots,t_{L}\}$ with $s<t$ and $\bm{x}_{T}=\hat{\bm{x}}_{T}$ . We wish to prove that $h(\hat{\bm{x}}_{s})<h(\hat{\bm{x}}_{t})$ , showing that the denoising process actually reduces the entropy.

Before we go about doing this, we make several remarks about the problem statement. First, Tweedie’s formula (3.2.20) says that

\bar{\bm{x}}^{\ast}(t,\bm{x}_{t})=\bm{x}_{t}+t^{2}\nabla p_{t}(\bm{x}_{t}),

(B.2.16)

which likens a full denoising step from time $t$ to time $0$ to a gradient step on the log-density of $\bm{x}_{t}$ . Can we get a similar result for the full denoising step from time $t$ to time $s$ in (B.2.15)? It turns out that indeed we can, and it is pretty simple. By using (B.2.15) and Tweedie’s formula (3.2.20), we obtain

\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}]=\frac{s}{t}\bm{x}_{t}+\left(1-\frac{s}{t}\right)\left(\bm{x}_{t}+t^{2}\nabla_{\bm{x}_{t}}\log p_{t}(\bm{x}_{t})\right)=\bm{x}_{t}+\left(1-\frac{s}{t}\right)t^{2}\nabla_{\bm{x}_{t}}\log p_{t}(\bm{x}_{t}).

(B.2.17)

So this iterative denoising step is again a gradient step on the perturbed log-density $\log p_{t}$ with a shrunken step size. In particular, this step can be seen as a perturbation of the distribution of the random variable $\bm{x}_{t}$ by the score function vector field, suggesting a connection to stochastic differential equations (SDEs) and the theory of diffusion models [SSK+21]. Indeed, a proof of the following result Theorem B.3 can be developed using this powerful machinery and a limiting argument (e.g., following the technical approach in the exposition of [CCL+23]). We will give a simpler proof here, which will use only elementary tools and thereby illuminate some of the key quantities behind the process of entropy reduction via denoising. On the other hand, we will need to deal with some slightly technical calculations due to the fact that the denoising process in Theorem B.3 does not correspond exactly to the reverse process associated to the noise addition process that generates the observation $\bm{x}_{t}$ .⁴⁴4For those familiar with diffusion models, we refer here to the time-reversed forward process not coinciding with the sequence of iterates generated by the process defined by Theorem B.3. These processes coincide in a certain limit where infinitely many steps of Theorem B.3 are taken with infinitely small levels of noise added at each step; for general, finite steps, we must introduce some approximations regardless of the level of sophistication of our tools.

We want to prove that $h(\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}])<h(\bm{x}_{t})$ , i.e., formally:

Theorem B.3.

Let $\bm{x}$ be any random variable such that Assumptions B.1 and B.2 hold, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1). Then

h(\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}])<h(\bm{x}_{t}),\qquad\forall s,t\in[0,T]\colon\quad 0<t\leq\frac{R}{\sqrt{2D}},\quad 0\leq s<t\cdot\min\left\{1,\frac{R^{2}/D-2t^{2}}{R^{2}/D-t^{2}}\right\}.

(B.2.18)

Proof.

This proof uses two main ideas:

1.

First, write down a density for $\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}]$ using a change-of-variables formula.
2.

Second, bound this density to control the entropy.

The change of variables is justified by Corollary B.1, which was originally derived in [Gri11].

We execute these ideas now. From Corollary B.1, we obtain that the function $\bar{\bm{x}}$ defined as $\bar{\bm{x}}(\bm{\xi})\doteq\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}=\bm{\xi}]$ is differentiable, injective, and thus invertible on its range, which we henceforth denote $\mathcal{X}\subseteq\mathbb{R}^{D}$ . We denote its inverse as $\bar{\bm{x}}^{-1}$ . Using a change-of-variables formula, the density $\bar{p}$ of $\bar{\bm{x}}(\bm{x}_{t})$ is given by

\bar{p}(\bm{\xi})\doteq\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))},

(B.2.19)

where (recall, from Section A.2) $\bar{\bm{x}}^{\prime}$ is the Jacobian of $\bar{\bm{x}}$ . Since from Lemma B.3 we know $\bar{\bm{x}}^{\prime}$ is a positive definite matrix, the determinant is positive and so the whole density is positive. Then it follows that

$\displaystyle h(\bar{\bm{x}}(\bm{x}_{t}))$	$\displaystyle=-\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\log\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\mathrm{d}\bm{\xi}$	(B.2.20)
	$\displaystyle=-\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\log((p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi}))\mathrm{d}\bm{\xi}$	(B.2.21)
	$\displaystyle\qquad+\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\log\det\left(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi}))\right)\mathrm{d}\bm{\xi}$	(B.2.22)
	$\displaystyle=-\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\mathrm{d}\bm{\xi}+\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\log\det\left(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi}))\right)\mathrm{d}\bm{\xi}$	(B.2.23)
	$\displaystyle=h(\bm{x}_{t})-\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\log\left(\frac{1}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\right)\mathrm{d}\bm{\xi}.$	(B.2.24)

We will study the last term (including the $-$ ), and show that it is negative.

By concavity, one has $-x\log x\leq 1-x$ for every $x\geq 0$ . Hence

$\displaystyle h(\bar{\bm{x}}(\bm{x}_{t}))-h(\bm{x}_{t})$	$\displaystyle=-\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\log\left(\frac{1}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\right)\mathrm{d}\bm{\xi}$	(B.2.25)
	$\displaystyle\leq\int_{\mathcal{X}}(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})\cdot\left(1-\frac{1}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\right)\mathrm{d}\bm{\xi}$	(B.2.26)
	$\displaystyle=\int_{\mathcal{X}}(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})\mathrm{d}\bm{\xi}-\int_{\mathcal{X}}\frac{(p_{t}\circ\bar{\bm{x}}^{-1})(\bm{\xi})}{\det(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi})))}\mathrm{d}\bm{\xi}$	(B.2.27)
	$\displaystyle=\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\det\left(\bar{\bm{x}}^{\prime}(\bar{\bm{x}}^{-1}(\bm{\xi}))\right)\mathrm{d}\bm{\xi}-\int_{\mathcal{X}}\bar{p}(\bm{\xi})\mathrm{d}\bm{\xi}$	(B.2.28)
	$\displaystyle=\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\det\left(\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\nabla^{2}\log p_{t}(\bm{\xi})\right)\mathrm{d}\bm{\xi}-1.$	(B.2.29)

Now, by the AM-GM inequality on eigenvalues, we have for any symmetric positive definite matrix $\bm{M}\in\mathsf{PSD}(D)$ the bound

\det(\bm{M})^{1/D}=\prod_{i=1}^{D}\lambda_{i}(\bm{M})^{1/D}\leq\frac{\sum_{i=1}^{D}\lambda_{i}(\bm{M})}{D}=\frac{\operatorname{tr}(\bm{M})}{D},

(B.2.30)

which we can apply to the above expression and obtain

	$\displaystyle\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\det\left(\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\nabla^{2}\log p_{t}(\bm{\xi})\right)\mathrm{d}\bm{\xi}$		(B.2.31)
	$\displaystyle\leq\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\operatorname{tr}\left(\frac{1}{D}\left[\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\nabla^{2}\log p_{t}(\bm{\xi})\right]\right)^{D}\mathrm{d}\bm{\xi}$		(B.2.32)
	$\displaystyle=\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\left(1+\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}\operatorname{tr}(\nabla^{2}\log p_{t}(\bm{\xi}))\right)^{D}\mathrm{d}\bm{\xi}$		(B.2.33)
	$\displaystyle=\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\left(1+\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}\Delta\log p_{t}(\bm{\xi})\right)^{D}\mathrm{d}\bm{\xi}.$		(B.2.34)

From Lemma B.5, it holds (where, recall, $R$ is the radius of the support of $\bm{x}$ as in Assumption B.1)

\lvert\Delta\log p_{t}(\bm{\xi})\rvert\leq\max\left(\frac{D}{t^{2}},\left\lvert\frac{R^{2}}{t^{4}}-\frac{D}{t^{2}}\right\rvert\right)=:U_{t}.

(B.2.35)

Then it holds

-\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}U_{t}\leq\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}\Delta\log p_{t}(\bm{\xi})\leq\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}U_{t}.

(B.2.36)

Meanwhile, the function $x\mapsto(1+x)^{D}$ is convex on $[-1,\infty)$ , so for $-(1-s/t)t^{2}U_{t}/D\leq x\leq(1-s/t)t^{2}U_{t}/D$ we have

	$\displaystyle(1+x)^{d}$	$\displaystyle\leq\left(1-\frac{\left(1-\frac{s}{t}\right)t^{2}U_{t}}{D}\right)^{D}+\underbrace{\left[\left(1+\frac{\left(1-\frac{s}{t}\right)t^{2}U_{t}}{D}\right)^{D}-\left(1-\frac{\left(1-\frac{s}{t}\right)t^{2}U_{t}}{D}\right)^{D}\right]}_{M(s,t,D)}x$		(B.2.37)
		$\displaystyle\leq 1+M(s,t,D)x.$		(B.2.38)

Here $M(s,t,D)>0$ since $U_{t}>0$ . In the above bound, we need to verify that the lower bound for $x$ is $\geq-1$ . Indeed,

	$\displaystyle-\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}U_{t}$	$\displaystyle=-\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}\max\left(\frac{D}{t^{2}},\left\lvert\frac{R^{2}}{t^{4}}-\frac{D}{t^{2}}\right\rvert\right)$		(B.2.39)
		$\displaystyle=-\left(1-\frac{s}{t}\right)\max\left(1,\left\lvert\frac{R^{2}}{Dt^{2}}-1\right\rvert\right)$		(B.2.40)

Notice that this is $\geq-1$ if and only if $\left(1-\frac{s}{t}\right)\cdot\left(\frac{R^{2}}{Dt^{2}}-1\right)\geq 1$ , i.e., $0<t<R/\sqrt{2D}$ and $0\leq s\leq t\cdot\frac{R^{2}/D-2t^{2}}{R^{2}/D-t^{2}}$ , as granted by the assumptions.

Applying this bound, we obtain

	$\displaystyle\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\left(1+\frac{\left(1-\frac{s}{t}\right)t^{2}}{D}\Delta\log p_{t}(\bm{\xi})\right)^{D}\mathrm{d}\bm{\xi}$		(B.2.41)
	$\displaystyle\leq\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\left(1+M(s,t,D)\Delta\log p_{t}(\bm{\xi})\right)\mathrm{d}\bm{\xi}$		(B.2.42)
	$\displaystyle=1+M(s,t,D)\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\Delta\log p_{t}(\bm{\xi})\mathrm{d}\bm{\xi}$		(B.2.43)
	$\displaystyle=1-M(s,t,D)\int_{\mathbb{R}^{D}}\langle\nabla p_{t}(\bm{\xi}),\nabla\log p_{t}(\bm{\xi})\rangle\mathrm{d}\bm{\xi}$		(B.2.44)
	$\displaystyle=1-M(s,t,D)\int_{\mathbb{R}^{D}}\frac{\\|\nabla p_{t}(\bm{\xi})\\|_{2}^{2}}{p_{t}(\bm{\xi})}\mathrm{d}\bm{\xi},$		(B.2.45)

where the last few lines are the same as in the proof of Theorem B.2. Combining this result with our previous estimate,

h(\bar{\bm{x}}(\bm{x}_{t}))-h(\bm{x}_{t})\leq-M(s,t,D)\int_{\mathbb{R}^{D}}\frac{\|\nabla p_{t}(\bm{\xi})\|_{2}^{2}}{p_{t}(\bm{\xi})}\mathrm{d}\bm{\xi}<0

(B.2.46)

where the inequality is strict by the same argument as in Theorem B.2. ∎

Notice that the bounds for $s$ and $t$ depend on the radius $R$ of the data distribution, and are not so general as the bounds in Theorem B.2. The result is actually “as general as needed” in the following sense. Note that if $\bm{x}$ has a twice continuously differentiable density supported on the ball of radius $R$ centered at $\bm{0}$ , then it does for $2R$ , and $3R$ , and so on, i.e., for any ball of radius $R^{\prime}>R$ . Thus one strategy to get the appropriate denoising guarantee is: fix a data dimension $D$ and discretization schedule, and then set (in the analysis) the data radius $R$ to be very large such that each denoising step satisfies the requirements for entropy decrease given in Theorem B.3. Then each step of the denoising process will indeed reduce the entropy, as desired.

B.2.3 Technical Lemmas and Intermediate Results

In this subsection we present technical results which power our main two conceptual theorems. Our presentation will be more or less standard for mathematics; we will start with the higher-level results first, and gradually move back to the more incremental results. The higher-level results will use the incremental results, and in this way we have an easy-to-read dependency ordering of the results: no result depends on those before it. Results which do not depend on each other are generally ordered by the place they appear in the above pair of proofs.

Finitneness of the Differential Entropy

We first show that the entropy exists along the stochastic process and is finite.

Lemma B.1.

Let $\bm{x}$ be any random variable, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1).

1.

For $t>0$ , the differential entropy $h(\bm{x}_{t})$ exists and is $>-\infty$ .
2.

If in addition Assumption B.1 holds for $\bm{x}$ , then $h(\bm{x})<\infty$ and $h(\bm{x}_{t})\ <\infty$ .

Proof.

To prove Lemma B.1.1, we use a classic yet tedious analysis argument. Since $\bm{x}_{t}$ has a density, we can write

h(\bm{x}_{t})=-\int_{\mathbb{R}^{D}}p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\mathrm{d}\bm{\xi}.

(B.2.47)

Accordingly, let $g\colon\mathbb{R}^{D}\to\mathbb{R}$ be defined as

g(\bm{\xi})\doteq-p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\implies h(\bm{x}_{t})=\int_{\mathbb{R}^{D}}g(\bm{\xi})\mathrm{d}\bm{\xi}.

(B.2.48)

As usual to bound the value of an integral in analysis, define

g_{+}(\bm{\xi})\doteq\max(g(\bm{\xi}),0),\quad g_{-}(\bm{\xi})\doteq\max(-g(\bm{\xi}),0)\quad\implies\quad g=g_{+}-g_{-}\quad\text{and}\quad g_{+},g_{-}\geq 0.

(B.2.49)

Then

h(\bm{x}_{t})=\int_{\mathbb{R}^{D}}g_{+}(\bm{\xi})\mathrm{d}\bm{\xi}-\int_{\mathbb{R}^{D}}g_{-}(\bm{\xi})\mathrm{d}\bm{\xi},

(B.2.50)

and both integrals are guaranteed to be non-negative since their integrands are.

In order to show that $h(\bm{x}_{t})$ is well-defined, we need to show that $\int_{\mathbb{R}^{D}}g_{+}(\bm{\xi})\mathrm{d}\bm{\xi}<\infty$ or $\int_{\mathbb{R}^{D}}g_{-}(\bm{\xi})\mathrm{d}\bm{\xi}<\infty$ . To show that $h(\bm{x}_{t})>-\infty$ , it merely suffices to show that $\int_{\mathbb{R}^{D}}g_{-}(\bm{\xi})\mathrm{d}\bm{\xi}<\infty$ . To bound the integral of $g_{-}$ we need to understand the quantity $g_{-}$ , namely, we want to characterize when $g$ is negative.

g(\bm{\xi})\leq 0\iff p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\geq 0\iff\log p_{t}(\bm{\xi})\geq 0\iff p_{t}(\bm{\xi})\geq 1.

(B.2.51)

Thus, it holds that

g_{-}(\bm{\xi})=\mathbf{1}(p_{t}(\bm{\xi})\geq 1)\cdot(-g(\bm{\xi}))=\mathbf{1}(p_{t}(\bm{\xi})\geq 1)p_{t}(\bm{\xi})\log p_{t}(\bm{\xi}).

(B.2.52)

In order to bound the integral of $g_{-}(\bm{\xi})$ , we need to show that $p_{t}$ is “not too concentrated,” namely that $p_{t}$ is not too large. To prove this, in this case we are lucky enough to be able to bound the function $g_{-}(\bm{\xi})$ itself. Namely, notice that

\max_{\bm{\xi}\in\mathbb{R}^{D}}\varphi_{t}(\bm{\xi}-\bm{x})=\varphi_{t}(\bm{0})=\frac{1}{(2\pi)^{D/2}t^{D}}=:C_{t}.

(B.2.53)

which blows up as $t\to 0$ but is finite for all finite $t$ . Therefore

p_{t}(\bm{\xi})=\operatorname{\mathbb{E}}\varphi_{t}(\bm{\xi}-\bm{x})\leq\operatorname{\mathbb{E}}C_{t}=C_{t}.

(B.2.54)

Now there are two cases.

•

If $C_{t}<1$ , then $p_{t}(\bm{\xi})<1$ , so the indicator is never $1$ , hence $g_{-}=0$ identically and its integral is also $0$ .

•

If $C_{t}\geq 1$ , then $\log C_{t}\geq 0$ , so since the logarithm is monotonically increasing,

$\displaystyle\int_{\mathbb{R}^{D}}g_{-}(\bm{\xi})\mathrm{d}\bm{\xi}$	$\displaystyle=\int_{\mathbb{R}^{D}}\mathbf{1}(p_{t}(\bm{\xi})\geq 1)p_{t}(\bm{\xi})\log p_{t}(\bm{\xi})\mathrm{d}\bm{\xi}$	(B.2.55)
	$\displaystyle=\operatorname{\mathbb{E}}[\mathbf{1}(p_{t}(\bm{x}_{t})\geq 1)\log p_{t}(\bm{x}_{t})]$	(B.2.56)
	$\displaystyle\leq\operatorname{\mathbb{E}}[\mathbf{1}(p_{t}(\bm{x}_{t})\geq 1)\log C_{t}]$	(B.2.57)
	$\displaystyle=\operatorname{\mathbb{P}}[p_{t}(\bm{x}_{t})\geq 1]\log C_{t}.$	(B.2.58)

Hence we have $\int_{\mathbb{R}^{D}}g_{-}(\bm{\xi})\mathrm{d}\bm{\xi}<\infty$ , so the differential entropy $h(\bm{x}_{t})$ exists and is $>-\infty$ .

To prove Lemma B.1.2, suppose that Assumption B.1 holds. We want to show that $h(\bm{x})<\infty$ and $h(\bm{x}_{t})<\infty$ . The mechanism for doing this is the same, and involves the maximum entropy result Theorem B.1. Namely, since $\bm{x}$ is absolutely bounded, it has a finite covariance which we will denote $\bm{\Sigma}$ . Then the covariance of $\bm{x}_{t}$ is $\bm{\Sigma}+t^{2}\bm{I}$ . Thus the entropy of $\bm{x}$ and $\bm{x}_{t}$ can be upper bounded by the entropy of normal distributions with the respective covariances, i.e., $\log[(2\pi e)^{D}\det(\bm{\Sigma})]$ or $\log[(2\pi e)^{D}\det(\bm{\Sigma}+t^{2}\bm{I})]$ , and both are $<\infty$ . ∎

Integration by Parts in De Brujin Identity

Finally, we fill in the integration-by-parts argument alluded to in the proofs of Theorems B.2 and B.3. The argument is conceptually pretty simple but requires some technical estimates to show that the boundary term in the integration-by-parts vanishes.

Lemma B.2.

Let $\bm{x}$ be any random variable such that Assumptions B.1 and B.2 hold, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1). For $t\geq 0$ , let $p_{t}$ be the density of $\bm{x}_{t}$ . Then for a constant $c\in\mathbb{R}$ it holds

\int_{\mathbb{R}^{D}}\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]\mathrm{d}\bm{\xi}=-\int_{\mathbb{R}^{D}}\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\mathrm{d}\bm{\xi}.

(B.2.59)

Proof.

The basic idea of this proof is in two steps:

•

First, apply Green’s theorem to do integration by parts over a compact set;
•

Second, send the radius of this compact set to $+\infty$ , to get integrals over all of $\mathbb{R}^{D}$ .

Green’s theorem says that for any compact set $\mathcal{K}\subseteq\mathbb{R}^{D}$ , twice continuously differentiable $\phi\colon\mathbb{R}^{D}\to\mathbb{R}$ , and continuously differentiable $\psi\colon\mathbb{R}^{D}\to\mathbb{R}$ ,

\int_{\mathcal{K}}\left\{\psi(\bm{\xi})\Delta\phi(\bm{\xi})+\langle\nabla\psi(\bm{\xi}),\nabla\phi(\bm{\xi})\rangle\right\}\mathrm{d}\bm{\xi}=\int_{\partial\mathcal{K}}\psi(\bm{\xi})\langle\nabla\phi(\bm{\xi}),\bm{n}(\bm{\xi})\rangle\mathrm{d}\sigma(\bm{\xi})

(B.2.60)

where $\mathrm{d}\sigma(\bm{\xi})$ denotes an integral over the “surface measure”, i.e., the inherited measure on $\partial\mathcal{K}$ , namely the boundary of $\mathcal{K}$ , and accordingly $\bm{\xi}$ takes values on this surface and $\bm{n}(\bm{\xi})$ is the unit normal vector to $\mathcal{K}$ at the surface point $\bm{\xi}$ . Now, taking $\phi(\bm{\xi})\doteq p_{t}(\bm{\xi})$ and $\psi(\bm{\xi})\doteq c+\log p_{t}(\bm{\xi})$ , over a ball $B_{r}(\bm{0})$ of radius $r>0$ centered at $\bm{0}$ (so that $\partial B_{r}(\bm{0})$ is the sphere of radius $r$ centered at $\bm{0}$ and $\bm{n}(\bm{\xi})=\bm{\xi}/\|\bm{\xi}\|_{2}=\bm{\xi}/r$ ):

	$\displaystyle\int_{B_{r}(\bm{0})}\left\{\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]+\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\right\}\mathrm{d}\bm{\xi}$		(B.2.61)
	$\displaystyle=\int_{\partial B_{r}(\bm{0})}[c+\log p_{t}(\bm{\xi})]\left\langle\nabla p_{t}(\bm{\xi}),\frac{\bm{\xi}}{r}\right\rangle\mathrm{d}\sigma(\bm{\xi})$		(B.2.62)
	$\displaystyle=\frac{1}{r}\int_{\partial B_{r}(\bm{0})}[c+\log p_{t}(\bm{\xi})]\left\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\right\rangle\mathrm{d}\sigma(\bm{\xi}).$		(B.2.63)

Sending $r\to\infty$ , it holds that

	$\displaystyle\int_{\mathbb{R}^{D}}\left\{\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]+\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\right\}\mathrm{d}\bm{\xi}$		(B.2.64)
	$\displaystyle=\lim_{r\to\infty}\int_{B_{r}(\bm{0})}\left\{\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]+\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\right\}\mathrm{d}\bm{\xi}$		(B.2.65)
	$\displaystyle=\lim_{r\to\infty}\int_{\partial B_{r}(\bm{0})}\left\{\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]+\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\right\}\mathrm{d}\bm{\xi}$		(B.2.66)
	$\displaystyle=\lim_{r\to\infty}\frac{1}{r}\int_{\partial B_{r}(\bm{0})}[c+\log p_{t}(\bm{\xi})]\left\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\right\rangle\mathrm{d}\sigma(\bm{\xi}),$		(B.2.67)

where the first inequality follows by dominated convergence on the integrand. It remains to compute the last limit. For this, we take asymptotic expansions of each term. The main device is as follows: for $\bm{\xi}\in\partial B_{r}(\bm{0})$ , we have $\|\bm{\xi}\|_{2}=r$ , so

$\displaystyle p_{t}(\bm{\xi})$	$\displaystyle=\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})]$	(B.2.68)
	$\displaystyle=\operatorname{\mathbb{E}}\left[\underbrace{\frac{1}{(2\pi)^{D/2}t^{D}}}_{\doteq C_{t}}e^{-\\|\bm{\xi}-\bm{x}\\|_{2}^{2}/(2t^{2})}\right]$	(B.2.69)
	$\displaystyle=C_{t}\operatorname{\mathbb{E}}\left[e^{-(\\|\bm{\xi}\\|_{2}^{2}-2\langle\bm{\xi},\bm{x}\rangle+\\|\bm{x}\\|_{2}^{2})/(2t^{2})}\right]$	(B.2.70)
	$\displaystyle=C_{t}\operatorname{\mathbb{E}}\left[e^{-(r^{2}-2\langle\bm{\xi},\bm{x}\rangle+\\|\bm{x}\\|_{2}^{2})/(2t^{2})}\right]$	(B.2.71)
	$\displaystyle=C_{t}e^{-r^{2}/(2t^{2})}\operatorname{\mathbb{E}}[e^{(2\langle\bm{\xi},\bm{x}\rangle-\\|\bm{x}\\|_{2}^{2})/(2t^{2})}].$	(B.2.72)

Note that because $\|\bm{\xi}\|_{2}=r$ , we have by Cauchy-Schwarz that

-2r\|\bm{x}\|_{2}-\|\bm{x}\|_{2}^{2}\leq 2\langle\bm{\xi},\bm{x}\rangle-\|\bm{x}\|_{2}^{2}\leq 2r\|\bm{x}\|_{2}-\|\bm{x}\|_{2}^{2}.

(B.2.73)

Recall that by Assumption B.1, $\bm{x}$ is supported on a compact set $\mathcal{S}$ of radius $R$ . Thus

-2R(r+R)\leq 2\langle\bm{\xi},\bm{x}\rangle-\|\bm{x}\|_{2}^{2}\leq 2Rr.

(B.2.74)

In other words, it holds

C_{t}e^{-[r^{2}+2R(r+R)]/(2t^{2})}\leq p_{t}(\bm{\xi})\leq C_{t}e^{[-r^{2}+2Rr]/(2t^{2})}.

(B.2.75)

Now to compute the gradient, we can use Proposition B.1 and linearity of expectation to compute

$\displaystyle\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle$	$\displaystyle=\left\langle-\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\left(\bm{\xi}-\bm{x}\right)\varphi_{t}(\bm{\xi}-\bm{x})\right],\bm{\xi}\right\rangle$	(B.2.76)
	$\displaystyle=-\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\langle\bm{\xi}-\bm{x},\bm{\xi}\rangle\varphi_{t}(\bm{\xi}-\bm{x})\right]$	(B.2.77)
	$\displaystyle=-\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\left(\\|\bm{\xi}\\|_{2}^{2}-\langle\bm{\xi},\bm{x}\rangle\right)\varphi_{t}(\bm{\xi}-\bm{x})\right]$	(B.2.78)
	$\displaystyle=-\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\left(r^{2}-\langle\bm{\xi},\bm{x}\rangle\right)\varphi_{t}(\bm{\xi}-\bm{x})\right]$	(B.2.79)
	$\displaystyle=\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\left(\langle\bm{\xi},\bm{x}\rangle-r^{2}\right)\varphi_{t}(\bm{\xi}-\bm{x})\right].$	(B.2.80)

Using Cauchy-Schwarz and the representation $p_{t}(\bm{\xi})\doteq\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})]$ again, it holds

	$\displaystyle\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\left(-Rr-r^{2}\right)\varphi_{t}(\bm{\xi}-\bm{x})\right]\leq\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle\leq\frac{1}{t^{2}}\operatorname{\mathbb{E}}\left[\left(Rr-r^{2}\right)\varphi_{t}(\bm{\xi}-\bm{x})\right]$		(B.2.81)
	$\displaystyle\frac{1}{t^{2}}\left(-Rr-r^{2}\right)\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\right]\leq\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle\leq\frac{1}{t^{2}}\left(Rr-r^{2}\right)\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\right]$		(B.2.82)
	$\displaystyle-\frac{r(R+r)}{t^{2}}p_{t}(\bm{\xi})\leq\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle\leq-\frac{r(r-R)}{t^{2}}p_{t}(\bm{\xi}).$		(B.2.83)

For $r>R>0$ (as is suitable, because we are going to take the limit $r\to\infty$ while $R$ is fixed), both sides are negative. This makes sense: most of the probability mass is contained within the ball of radius $R$ and thus the score points inwards, having a negative inner product with the outward-pointing vector $\bm{\xi}$ . Thus using the appropriate bounds for $p_{t}(\bm{\xi})$ ,

-\frac{r(R+r)}{t^{2}}\cdot C_{t}e^{[-r^{2}+2Rr]/(2t^{2})}\leq\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle\leq-\frac{r(r-R)}{t^{2}}\cdot C_{t}e^{-[r^{2}+2R(r+R)]/(2t^{2})}.

(B.2.84)

Then, noting that $C_{t}=\mathrm{poly}(t^{-1})$ , we can compute

[c+\log p_{t}(\bm{\xi})]\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle=\mathrm{poly}(r,R,t^{-1},c)e^{-\Theta_{r}(r^{2})}

(B.2.85)

So one can see that, letting the surface area of $\partial B_{r}(\bm{0})$ be $\omega_{D}r^{D-1}$ where $\omega_{D}$ is a function of $D$ , it holds

\frac{1}{r}\int_{\partial B_{r}(\bm{0})}[c+\log p_{t}(\bm{\xi})]\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle\mathrm{d}\bm{\xi}=\mathrm{poly}(r,R,t^{-1},c)e^{-\Theta_{r}(r^{2})}

(B.2.86)

and therefore the exponentially decaying tails mean

\lim_{r\to\infty}\frac{1}{r}\int_{\partial B_{r}(\bm{0})}[c+\log p_{t}(\bm{\xi})]\langle\nabla p_{t}(\bm{\xi}),\bm{\xi}\rangle\mathrm{d}\bm{\xi}=0.

(B.2.87)

Finally, plugging into (B.2.64), we have

		$\displaystyle\int_{\mathbb{R}^{D}}\left\{\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]+\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\right\}\mathrm{d}\bm{\xi}=0$		(B.2.88)
	$\displaystyle\implies$	$\displaystyle\int_{\mathbb{R}^{D}}\Delta p_{t}(\bm{\xi})[c+\log p_{t}(\bm{\xi})]\mathrm{d}\bm{\xi}=-\int_{\mathbb{R}^{D}}\langle\nabla\log p_{t}(\bm{\xi}),\nabla p_{t}(\bm{\xi})\rangle\mathrm{d}\bm{\xi}$		(B.2.89)

as claimed. ∎

Local Invertibility of the Denoiser $\bar{\bm{x}}$

Here we provide some results used in the proof of Theorem B.3 which are appropriate generalizations of corresponding results in [Gri11].

Lemma B.3 (Generalization of [Gri11], Lemma A.1).

Let $\bm{x}$ be any random variable such that Assumptions B.1 and B.2 hold, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1). Let $s,t\in[0,T]$ be such that $0\leq s<t\leq T$ , and let $\bar{\bm{x}}(\bm{\xi})\doteq\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}=\bm{\xi}]$ . The Jacobian $\bar{\bm{x}}^{\prime}(\bm{\xi})$ is symmetric positive definite.

Proof.

We have

\bar{\bm{x}}^{\prime}(\bm{\xi})=\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\nabla^{2}\log p_{t}(\bm{\xi}).

(B.2.90)

Here we expand

\nabla^{2}\log p_{t}(\bm{\xi})=\frac{p_{t}(\bm{\xi})\nabla^{2}p_{t}(\bm{\xi})-(\nabla p_{t}(\bm{\xi}))(\nabla p_{t}(\bm{\xi}))^{\top}}{p_{t}(\bm{\xi})^{2}}.

(B.2.91)

So we need to ensure that

	$\displaystyle\bar{\bm{x}}^{\prime}(\bm{\xi})$	$\displaystyle=\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\frac{p_{t}(\bm{\xi})\nabla^{2}p_{t}(\bm{\xi})-(\nabla p_{t}(\bm{\xi}))(\nabla p_{t}(\bm{\xi}))^{\top}}{p_{t}(\bm{\xi})^{2}}$		(B.2.92)
		$\displaystyle=\frac{p_{t}(\bm{\xi})^{2}\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\left[p_{t}(\bm{\xi})\nabla^{2}p_{t}(\bm{\xi})-(\nabla p_{t}(\bm{\xi}))(\nabla p_{t}(\bm{\xi}))^{\top}\right]}{p_{t}(\bm{\xi})^{2}}$		(B.2.93)

is symmetric positive semidefinite. Indeed it is obviously symmetric (by Clairaut’s theorem). To show its positive semidefiniteness, we plug in the expectation representation of $p_{t}$ given by (B.2.3) (and $\nabla p_{t}$ , $\Delta p_{t}$ by Proposition B.1) to obtain (where $\bm{x}$ is as defined and $\bm{y}$ is i.i.d. as $\bm{x}$ ),

	$\displaystyle\bm{v}^{\top}[\bar{\bm{x}}^{\prime}(\bm{\xi})]\bm{v}$		(B.2.94)
	$\displaystyle=p_{t}(\bm{\xi})^{-2}\bm{v}^{\top}\Bigg{\{}p_{t}(\bm{\xi})^{2}\bm{I}+\left(1-\frac{s}{t}\right)t^{2}\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})]\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{x})^{\top}-t^{2}\bm{I}}{t^{4}}\right]$		(B.2.95)
	$\displaystyle\qquad\qquad\qquad-\left(1-\frac{s}{t}\right)t^{2}\operatorname{\mathbb{E}}\left[-\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{\bm{\xi}-\bm{x}}{t^{2}}\right]\operatorname{\mathbb{E}}\left[-\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{\bm{\xi}-\bm{x}}{t^{2}}\right]^{\top}\Bigg{\}}\bm{v}$		(B.2.96)
	$\displaystyle=p_{t}(\bm{\xi})^{-2}\bm{v}^{\top}\Bigg{\{}\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})\varphi_{t}(\bm{\xi}-\bm{y})\bm{I}]$		(B.2.97)
	$\displaystyle\qquad\qquad\qquad+\left(1-\frac{s}{t}\right)t^{2}\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\varphi_{t}(\bm{\xi}-\bm{y})\cdot\frac{(\bm{\xi}-\bm{y})(\bm{\xi}-\bm{y})^{\top}-t^{2}\bm{I}}{t^{4}}\right]$		(B.2.98)
	$\displaystyle\qquad\qquad\qquad-\left(1-\frac{s}{t}\right)t^{2}\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\varphi_{t}(\bm{\xi}-\bm{y})\cdot\frac{(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{y})^{\top}}{t^{4}}\right]\Bigg{\}}$		(B.2.99)
	$\displaystyle=\frac{1-s/t}{p_{t}(\bm{\xi})^{2}}\bm{v}^{\top}\operatorname{\mathbb{E}}\mathopen{}\Bigg{[}\varphi_{t}(\bm{\xi}-\bm{x})\varphi_{t}(\bm{\xi}-\bm{y})\left\{\frac{1}{1-s/t}\bm{I}+\frac{(\bm{\xi}-\bm{y})(\bm{\xi}-\bm{y})^{\top}}{t^{2}}-\bm{I}-\frac{(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{y})^{\top}}{t^{2}}\right\}\Bigg{]}\bm{v}$		(B.2.100)
	$\displaystyle=\frac{t-s}{tp_{t}(\bm{\xi})^{2}}\bm{v}^{\top}\operatorname{\mathbb{E}}\left[\frac{s}{t-s}\bm{I}+\frac{(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{x})^{\top}}{2t^{2}}+\frac{(\bm{\xi}-\bm{y})(\bm{\xi}-\bm{y})^{\top}}{2t^{2}}-\frac{(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{y})^{\top}}{t^{2}}\right]\bm{v}$		(B.2.101)
	$\displaystyle=\frac{t-s}{tp_{t}(\bm{\xi})^{2}}\bm{v}^{\top}\operatorname{\mathbb{E}}\left[\frac{s}{t-s}\bm{I}+\frac{1}{2t^{2}}\left((\bm{\xi}-\bm{x})(\bm{\xi}-\bm{x})^{\top}+(\bm{\xi}-\bm{y})(\bm{\xi}-\bm{y})^{\top}-2(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{y})^{\top}\right)\right]\bm{v}$		(B.2.102)
	$\displaystyle=\frac{t-s}{tp_{t}(\bm{\xi})^{2}}\operatorname{\mathbb{E}}\left[\frac{s}{t-s}\\|\bm{v}\\|_{2}^{2}+\frac{1}{2t^{2}}\left([(\bm{\xi}-\bm{x})^{\top}\bm{v}]^{2}+[(\bm{\xi}-\bm{y})^{\top}\bm{v}]^{2}-2[(\bm{\xi}-\bm{x})^{\top}\bm{v}][(\bm{\xi}-\bm{y})^{\top}\bm{v}]\right)\right]$		(B.2.103)
	$\displaystyle=\frac{t-s}{tp_{t}(\bm{\xi})^{2}}\operatorname{\mathbb{E}}\left[\frac{s}{t-s}\\|\bm{v}\\|_{2}^{2}+\frac{1}{2t^{2}}\left([(\bm{\xi}-\bm{x})^{\top}\bm{v}]^{2}+[(\bm{\xi}-\bm{y})^{\top}\bm{v}]^{2}-2[(\bm{\xi}-\bm{x})^{\top}\bm{v}][(\bm{\xi}-\bm{y})^{\top}\bm{v}]\right)\right]$		(B.2.104)
	$\displaystyle=\frac{t-s}{tp_{t}(\bm{\xi})^{2}}\operatorname{\mathbb{E}}\left[\frac{s}{t-s}\\|\bm{v}\\|_{2}^{2}+\frac{1}{2t^{2}}\left([(\bm{\xi}-\bm{x})^{\top}\bm{v}]-[(\bm{\xi}-\bm{y})^{\top}\bm{v}]\right)^{2}\right]$		(B.2.105)
	$\displaystyle=\frac{t-s}{tp_{t}(\bm{\xi})^{2}}\operatorname{\mathbb{E}}\left[\frac{s}{t-s}\\|\bm{v}\\|_{2}^{2}+\frac{1}{2t^{2}}[(\bm{y}-\bm{x})^{\top}\bm{v}]^{2}\right]$		(B.2.106)
	$\displaystyle=\frac{s}{tp_{t}(\bm{\xi})^{2}}\\|\bm{v}\\|_{2}^{2}+\frac{t-s}{2t^{3}p_{t}(\bm{\xi})}\operatorname{\mathbb{E}}[\{(\bm{y}-\bm{x})^{\top}\bm{v}\}^{2}]$		(B.2.107)

Since $\bm{x}$ and $\bm{y}$ are i.i.d., the whole integral (i.e., the original quadratic form) is $0$ if and only if $s=0$ and $\bm{x}$ has support entirely contained in an affine subspace which is orthogonal to $\bm{v}$ . But this is ruled out by assumption (i.e., that $\bm{x}$ has a density on $\mathbb{R}^{D}$ ), so the Jacobian $\bar{\bm{x}}^{\prime}(\bm{\xi})$ is symmetric positive definite. ∎

Lemma B.4 (Generalization of [Gri11] Corollary A.2, Part 1).

Let $f\colon\mathbb{R}^{D}\to\mathbb{R}^{D}$ be any differentiable function whose Jacobian $f^{\prime}(\bm{x})$ is symmetric positive definite. Then $f$ is injective, and hence invertible as a function $\mathbb{R}^{D}\to\mathcal{R}(f)$ where $\mathcal{R}(f)$ is the range of $f$ .

Proof.

Suppose that $f$ were not injective, i.e., there exists $\bm{x},\bm{x}^{\prime}$ such that $f(\bm{x})=f(\bm{x}^{\prime})$ while $\bm{x}\neq\bm{x}^{\prime}$ . Define $\bm{v}\doteq(\bm{x}^{\prime}-\bm{x})/\|\bm{x}^{\prime}-\bm{x}\|_{2}$ . Define the function $g\colon\mathbb{R}\to\mathbb{R}$ as $g(t)\doteq\bm{v}^{\top}f(\bm{x}+t\bm{v})$ . Then $g(0)=\bm{v}^{\top}f(\bm{x})=\bm{v}^{\top}f(\bm{x}^{\prime})=g(\|\bm{x}^{\prime}-\bm{x}\|_{2})$ . Since $f$ is differentiable, $g$ is differentiable, so the derivative $g^{\prime}$ must vanish for some $t^{\ast}\in(0,\|\bm{x}^{\prime}-\bm{x}\|_{2})$ by the mean value theorem. However,

g^{\prime}(t^{\ast})\doteq\bm{v}^{\top}\left[f^{\prime}(\bm{x}+t^{\ast}\bm{v})\right]\bm{v}>0

(B.2.108)

since the Jacobian is positive definite. Thus we arrive at a contradiction, as claimed. ∎

Combining the above two results, we obtain the following crucial result.

Corollary B.1 (Generalization of [Gri11] Corollary A.2, Part 2).

Let $\bm{x}$ be any random variable such that Assumptions B.1 and B.2 hold, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1). Let $s,t\in[0,T]$ be such that $0\leq s<t\leq T$ , and let $\bar{\bm{x}}(\bm{\xi})\doteq\operatorname{\mathbb{E}}[\bm{x}_{s}\mid\bm{x}_{t}=\bm{\xi}]$ . Then $\bar{\bm{x}}$ is injective, and therefore invertible onto its range.

Proof.

The only thing left to show is that $\bar{\bm{x}}$ is differentiable, but this is immediate from Tweedie’s formula (Theorem 3.3) which shows that $\bar{\bm{x}}$ is differentiable if and only if $\nabla\log p_{t}$ is differentiable, and this is provided by Equation B.2.3. ∎

Controlling the Laplacian $\Delta\log p_{t}$

Finally, we develop a technical estimate which is required for the proof of Theorem B.3 and actually motivates the assumption for the viable $t$ .

Lemma B.5.

Let $\bm{x}$ be any random variable such that Assumptions B.1 and B.2 hold, and let $(\bm{x}_{t})_{t\in[0,T]}$ be the stochastic process (B.2.1). Let $p_{t}$ be the density of $\bm{x}_{t}$ . Then, for $t>0$ it holds

\sup_{\bm{\xi}\in\mathbb{R}^{D}}\lvert\Delta\log p_{t}(\bm{\xi})\rvert\leq\max\left(\frac{D}{t^{2}},\left\lvert\frac{R}{t^{4}}-\frac{D}{t^{2}}\right\rvert\right).

(B.2.109)

Proof.

By chain rule, a simple exercise computes

\Delta\log p_{t}(\bm{\xi})=\frac{\Delta p_{t}(\bm{\xi})}{p_{t}(\bm{\xi})}-\frac{\|\nabla p_{t}(\bm{\xi})\|_{2}^{2}}{p_{t}(\bm{\xi})^{2}}.

(B.2.110)

Using Proposition B.1 to write the terms in $\Delta p_{t}(\bm{\xi})$ , we obtain

	$\displaystyle\frac{\Delta p_{t}(\bm{\xi})}{p_{t}(\bm{\xi})}$	$\displaystyle=\frac{\operatorname{\mathbb{E}}\left[\frac{\\|\bm{\xi}-\bm{x}\\|_{2}^{2}-Dt^{2}}{t^{4}}\cdot\varphi_{t}(\bm{\xi}-\bm{x})\right]}{\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})]}$		(B.2.111)
		$\displaystyle=\frac{\int_{\mathbb{R}^{D}}\left\{\frac{\\|\bm{\xi}-\bm{u}\\|_{2}^{2}-Dt^{2}}{t^{4}}\right\}\varphi_{t}(\bm{\xi}-\bm{u})p(\bm{u})\mathrm{d}\bm{u}}{\int_{\mathbb{R}^{D}}\varphi_{t}(\bm{\xi}-\bm{u})p(\bm{u})\mathrm{d}\bm{u}}.$		(B.2.112)

This looks like a Bayesian marginalization, so let us define the appropriate normalized density

q_{\bm{\xi}}(\bm{u})=\frac{\varphi_{t}(\bm{\xi}-\bm{u})p(\bm{u})}{\int_{\mathbb{R}^{D}}\varphi_{t}(\bm{\xi}-\bm{v})p(\bm{v})\mathrm{d}\bm{v}}=\frac{\varphi_{t}(\bm{\xi}-\bm{u})p(\bm{u})}{\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})]}=\frac{\varphi_{t}(\bm{\xi}-\bm{u})p(\bm{u})}{p_{t}(\bm{\xi})}

(B.2.113)

Then, defining $\bm{y}_{\bm{\xi}}\sim q_{\bm{\xi}}$ , we can write

\frac{\Delta p_{t}(\bm{\xi})}{p_{t}(\bm{\xi})}=\int_{\mathbb{R}^{D}}\left\{\frac{\|\bm{\xi}-\bm{u}\|_{2}^{2}-Dt^{2}}{t^{4}}\right\}q_{\bm{\xi}}(\bm{u})\mathrm{d}\bm{u}=\frac{1}{t^{4}}\operatorname{\mathbb{E}}[\|\bm{\xi}-\bm{y}_{\bm{\xi}}\|_{2}^{2}]-\frac{D}{t^{2}}.

(B.2.114)

Similarly, writing out the second term (non-squared) we obtain

\frac{\nabla p_{t}(\bm{\xi})}{p_{t}(\bm{\xi})}=-\frac{\bm{\xi}-\operatorname{\mathbb{E}}[\bm{y}_{\bm{\xi}}]}{t^{2}}.

(B.2.115)

Letting $\bm{z}_{\bm{\xi}}\doteq\bm{y}_{\bm{\xi}}-\bm{\xi}$ , it holds

\frac{\Delta p_{t}(\bm{\xi})}{p_{t}(\bm{\xi})}=\frac{\operatorname{\mathbb{E}}[\|\bm{z}_{\bm{\xi}}\|_{2}^{2}]}{t^{4}}-\frac{D}{t^{2}},\qquad\frac{\nabla p_{t}(\bm{\xi})}{p_{t}(\bm{\xi})}=\frac{\operatorname{\mathbb{E}}[\bm{z}_{\bm{\xi}}]}{t^{2}}.

(B.2.116)

Thus writing $\Delta\log p_{t}$ out fully, we have

$\displaystyle\Delta\log p_{t}(\bm{\xi})$	$\displaystyle=\frac{\operatorname{\mathbb{E}}[\\|\bm{z}_{\bm{\xi}}\\|_{2}^{2}]}{t^{4}}-\frac{D}{t^{2}}-\frac{\\|\operatorname{\mathbb{E}}[\bm{z}_{\bm{\xi}}]\\|_{2}^{2}}{t^{4}}$	(B.2.117)
	$\displaystyle=\frac{\operatorname{\mathbb{E}}[\\|\bm{z}_{\bm{\xi}}\\|_{2}^{2}]-\\|\operatorname{\mathbb{E}}[\bm{z}_{\bm{\xi}}]\\|_{2}^{2}}{t^{4}}-\frac{D}{t^{2}}$	(B.2.118)
	$\displaystyle=\frac{\operatorname{tr}(\operatorname{Cov}(\bm{z}_{\bm{\xi}}))}{t^{4}}-\frac{D}{t^{2}}$	(B.2.119)
	$\displaystyle=\frac{\operatorname{tr}(\operatorname{Cov}(\bm{y}_{\bm{\xi}}))}{t^{4}}-\frac{D}{t^{2}}.$	(B.2.120)

A trivial lower bound on this trace is $0$ , since covariance matrices are positive semidefinite. To find an upper bound, note that $\bm{y}_{\bm{\xi}}$ takes values only in the support of $\bm{x}$ (since $p$ is a factor of the density $q_{\bm{\xi}}$ of $\bm{y}_{\bm{\xi}}$ ), which by Assumption B.1 is a compact set $\mathcal{S}$ with radius $R\doteq\sup_{\bm{\xi}\in\mathbb{R}^{D}}\|\bm{\xi}\|_{2}$ . So

\operatorname{tr}(\operatorname{Cov}(\bm{y}_{\bm{\xi}}))=\operatorname{\mathbb{E}}[\|\bm{y}_{\bm{\xi}}\|_{2}^{2}]-\|\operatorname{\mathbb{E}}[\bm{y}_{\bm{\xi}}]\|_{2}^{2}\leq\operatorname{\mathbb{E}}[\|\bm{y}_{\bm{\xi}}\|_{2}^{2}]\leq R^{2}.

(B.2.121)

Therefore

-\frac{D}{t^{2}}\leq\Delta\log p_{t}(\bm{\xi})\leq\frac{R^{2}}{t^{4}}-\frac{D}{t^{2}},

(B.2.122)

which shows the claim. ∎

Derivative Computations

Here we calculate some useful derivatives which will be reused throughout the appendix.

Proposition B.1.

$\displaystyle\frac{\partial p_{t}}{\partial t}(\bm{\xi})$	$\displaystyle=\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{\\|\bm{\xi}-\bm{x}\\|_{2}^{2}-Dt^{2}}{t^{3}}\right]$	(B.2.123)
$\displaystyle\nabla p_{t}(\bm{\xi})$	$\displaystyle=-\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{\bm{\xi}-\bm{x}}{t^{2}}\right]$	(B.2.124)
$\displaystyle\nabla^{2}p_{t}(\bm{\xi})$	$\displaystyle=\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{(\bm{\xi}-\bm{x})(\bm{\xi}-\bm{x})^{\top}-t^{2}\bm{I}}{t^{4}}\right]$	(B.2.125)
$\displaystyle\Delta p_{t}(\bm{\xi})$	$\displaystyle=\operatorname{\mathbb{E}}\left[\varphi_{t}(\bm{\xi}-\bm{x})\cdot\frac{\\|\bm{\xi}-\bm{x}\\|_{2}^{2}-Dt^{2}}{t^{4}}\right].$	(B.2.126)

Proof.

We use the convolution representation of $p_{t}$ , namely (B.2.3). First taking the time derivative, a computation yields that Proposition B.3 applies,⁵⁵5We use $f_{t}(\bm{\xi})=p(\bm{\xi})\varphi_{t}(\bm{\xi}-\bm{x})$ , noting that it is twice continuously differentiable in $\bm{\xi}$ and (more than) twice continuously differentiable in $t$ . Then to check the local integrability of $f_{t}$ we compute $\frac{\partial f_{t}}{\partial t}(\bm{\xi})=f_{t}(\bm{\xi})\cdot\frac{1}{t^{3}}(\|\bm{\xi}-\bm{x}\|_{2}^{2}-Dt^{2})$ , which is is easy to check integrable over $\bm{\xi}$ and $t\in[t_{\min},t_{\max}]$ where $t_{\min}>0$ . (Indeed, $f_{t}$ has exponentially decaying tails, so the quadratic term in the product is of no issue.) so we can bring the derivative inside the integral/expectation as:

\frac{\partial p_{t}}{\partial t}(\bm{\xi})=\frac{\partial}{\partial t}\operatorname{\mathbb{E}}[\varphi_{t}(\bm{\xi}-\bm{x})]=\operatorname{\mathbb{E}}\left[\frac{\partial}{\partial t}\varphi_{t}(\bm{\xi}-\bm{x})\right]=\frac{\partial\varphi_{t}}{\partial t}*p.

(B.2.127)

Meanwhile, by properties of convolutions (Proposition B.4) and using the fact that $p$ is compactly supported (Assumption B.1),

p_{t}=\varphi_{t}*p\implies\nabla p_{t}=\nabla\varphi_{t}*p\implies\nabla^{2}p_{t}=\nabla^{2}\varphi_{t}*p\implies\Delta p_{t}=\Delta\varphi_{t}*p.

(B.2.128)

The rest of the computation follows from Proposition B.2. ∎

Proposition B.2.

For $t>0$ and $\bm{\xi}\in\mathbb{R}^{D}$ it holds

$\displaystyle\frac{\partial}{\partial t}\varphi_{t}(\bm{\xi})$	$\displaystyle=\varphi_{t}(\bm{\xi})\cdot\frac{\\|\bm{\xi}\\|_{2}^{2}-Dt^{2}}{t^{3}}$	(B.2.129)
$\displaystyle\nabla\varphi_{t}(\bm{\xi})$	$\displaystyle=-\varphi_{t}(\bm{\xi})\cdot\frac{\bm{\xi}}{t^{2}}$	(B.2.130)
$\displaystyle\nabla^{2}\varphi_{t}(\bm{\xi})$	$\displaystyle=\varphi_{t}(\bm{\xi})\cdot\frac{\bm{\xi}\bm{\xi}^{\top}-t^{2}\bm{I}}{t^{4}}$	(B.2.131)
$\displaystyle\Delta\varphi_{t}(\bm{\xi})$	$\displaystyle=\varphi_{t}(\bm{\xi})\cdot\frac{\\|\bm{\xi}\\|_{2}^{2}-Dt^{2}}{t^{4}}.$	(B.2.132)

Proof.

Direct computation. ∎

Differentiating Under the Integral Sign

In this appendix, we differentiate under the integral sign many times, and it is important to know when we can do this. There are two kinds of differentiating under the integral sign:

1.

Differentiating an integral $\int f_{t}(\bm{\xi})\mathrm{d}\bm{\xi}$ with respect to the auxiliary parameter $t$ .
2.

Differentiating a convolution $(f*g)(\bm{\xi})=\int f(\bm{\xi})g(\bm{\xi}-\bm{u})\mathrm{d}u$ with respect to the variable $\bm{\xi}$ .

For the first category, we give a concrete result, stated without proof but attributable to the linked source, which derives the following result as a special case of a more general theorem about the interaction of differential operators and tempered distributions, much beyond the scope of the book. A full formal reference can be found in [Jon82].

Proposition B.3 ([Jon82], Section 11.12).

Let $f\colon(0,T)\times\mathbb{R}^{D}\to\mathbb{R}$ be such that:

•

$f$ is a jointly measurable function of $(t,\bm{\xi})$ ;
•

For Lebesgue-almost every $\bm{\xi}\in\mathbb{R}^{D}$ , the function $t\mapsto f_{t}(\bm{\xi})$ is absolutely continuous;

•

$\frac{\partial f_{t}}{\partial t}$ is locally integrable, i.e., for every $[t_{\min},t_{\max}]\subseteq(0,T)$ it holds

\int_{t_{\min}}^{t_{\max}}\int_{\mathbb{R}^{D}}\left\lvert\frac{\partial f_{t}}{\partial t}(\bm{\xi})\right\rvert\mathrm{d}\bm{\xi}<\infty.

(B.2.133)

Then $t\mapsto\int_{\mathbb{R}^{D}}f_{t}(\bm{\xi})\mathrm{d}\bm{\xi}$ is an absolutely continuous function on $(0,T)$ , and its derivative is

\frac{\mathrm{d}}{\mathrm{d}t}\int_{\mathbb{R}^{D}}f_{t}(\bm{\xi})\mathrm{d}\bm{\xi}=\int_{\mathbb{R}^{D}}\frac{\partial}{\partial t}f_{t}(\bm{\xi})\mathrm{d}\bm{\xi},

(B.2.134)

defined for almost every $t\in(0,T)$ .

For the second category, we give another concrete result, stated without proof but fully formalized in [BB11].

Proposition B.4 ([BB11], Proposition 4.20).

Let $f$ be $k$ -times continuously differentiable with compact support, and let $g$ be locally integrable. Then the convolution $f*g$ defined by

(f*g)(\bm{\xi})\doteq\int_{\mathbb{R}^{D}}f(\bm{u})g(\bm{\xi}-\bm{u})\mathrm{d}\bm{u}

(B.2.135)

is $k$ -times continuously differentiable, and its derivative of order $k$ is

\nabla^{k}(f*g)=(\nabla^{k}f)*g.

(B.2.136)

Although not in the book, a simple integration by parts argument shows that if $g$ is also $k$ -times differentiable, then we can “trade off” the regularity:

\nabla^{k}(f*g)=f*(\nabla^{k}g).

(B.2.137)

B.3 Lossy Coding and Sphere Packing

In this section, we prove Theorem 3.6. Following our conventions throughout this appendix, we write $\mathcal{S}=\operatorname{Supp}(\bm{x})$ for the compact support of the random variable $\bm{x}$ .

As foreshadowed, we will make a regularity assumption on the support set $\mathcal{S}$ to prove Theorem 3.6. One possibility for proceeding under minimal assumptions would be to instantiate the results of [RBK18, RKB23] in our setting, since these results apply to sets $\mathcal{S}$ with very low regularity (e.g., Cantor-like sets with fractal structure). However, we have found precisely computing the constants in these results, a necessary endeavor to assert a conclusion like Theorem 3.6, to be somewhat onerous in our setting. Our approach is therefore to add a geometric regularity assumption on the set $\mathcal{S}$ that sacrifices some generality, but allows us to develop a more transparent argument. To avoid sacrificing too much generality, we must ensure that low-dimensionality in the set $\mathcal{S}$ is not prohibited. We therefore consider the running example we have used throughout the book, the mixture of low-rank Gaussian distributions. In this geometric setting, we will enforce this via assuming that $\mathcal{S}$ is a union of hyperspheres, which is equivalent to the Gaussian assumption in high dimensions with overwhelming probability.

Assumption B.3.

The support $\mathcal{S}\subset\mathbb{R}^{D}$ of the random variable $\bm{x}$ is a finite union of $K$ spheres, each with dimension $d_{k}$ , $k\in[K]$ . The probability that $\bm{x}$ is drawn from the $k$ -th sphere is given by $\pi_{k}\in[0,1]$ , and conditional on being drawn from the $k$ -th sphere, $\bm{x}$ is uniformly distributed on that sphere. The supports satisfy that each sphere is mutually orthogonal with all others.

We proceed under the simplifying Assumption B.3 in order to simplify excessive technicality, and to connect to an important running example used throughout the monograph. We believe our results can be generalized to support $\mathcal{S}$ from the class of sets with positive reach with additional technical effort, but leave this for the future.

B.3.1 Proof of Relationship Between Rate Distortion and Covering

We briefly sketch the proof, then proceed to establishing three fundamental lemmas, then give the proof. The proof will depend on notions introduced in the sketch below.

Obtaining an upper bound on the rate distortion function (3.3.3) is straightforward: by the rate characterization (i.e., the rate distortion function is the minimum rate of a code for $\bm{x}$ with expected squared $\ell^{2}$ distortion $\epsilon$ ), upper bounding $\mathcal{R}_{\epsilon}(\bm{x})$ only requires demonstrating one code for $\bm{x}$ that achieves this target distortion, and any $\epsilon$ -covering of $\operatorname{Supp}(\bm{x})$ achieves this, with rate equal to the base- $2$ logarithm of the cardinality of the covering. The lower bound is more subtle. We make use of the Shannon lower bound, discussed in Remark 3.7: working out the constants in [LZ94, §III, (22)] gives a more precise version of the result quoted in Equation 3.3.10 (in bits, of course): for any random variable $\bm{x}$ with compact support and a density, it holds

\mathcal{R}_{\epsilon}(\bm{x})\geq h(\bm{x})-\log\operatorname{vol}(B_{\epsilon})+\log\left(\frac{2}{D\Gamma(D/2)}\left(\frac{D}{2e}\right)^{D/2}\right),

(B.3.1)

with entropy (etc.) in nats in this expression. The constant can be easily estimated using Stirling’s approximation. A quantitative form of Stirling’s approximation which is often useful gives for any $x>0$ [Jam15]

\Gamma(x)\leq\sqrt{2\pi}x^{x-1/2}e^{-x}e^{1/(12x)}.

(B.3.2)

We will apply this bound to $\Gamma(D/2)$ in Equation B.3.1. We get

	$\displaystyle\log\left(\frac{2}{D\Gamma(D/2)}\left(\frac{D}{2e}\right)^{D/2}\right)$	$\displaystyle\geq-\frac{1}{6D}+\log\left(\frac{2}{D}\left(\frac{D}{2e}\right)^{D/2}\cdot\sqrt{\frac{D}{4\pi}}\left(\frac{D}{2e}\right)^{-D/2}\right)$		(B.3.3)
		$\displaystyle=-\frac{1}{6D}-\frac{1}{2}\log D\pi,$		(B.3.4)

which we can take for the explicit value of the constant $C_{D}$ in Equation 3.3.10. Summarizing the fully quantified Shannon lower bound (in bits):

\mathcal{R}_{\epsilon}(\bm{x})\geq h(\bm{x})-\log_{2}\operatorname{vol}(B_{\epsilon})-O(\log D).

(B.3.5)

Now, the important constraint for our current purposes is that the Shannon lower bound requires the random variable $\bm{x}$ to have a density, which rules out many low-dimensional distributions of interest. But let us momentarily consider the situation when $\bm{x}$ does admit a density. The assumption that $\bm{x}$ is uniformly distributed on its support is easily formalized in this setting: for any Borel set $A\subset\mathcal{S}$ , we have

\operatorname{\mathbb{P}}[\bm{x}\in A]=\int_{A}\frac{1}{\operatorname{vol}(\mathcal{S})}\mathrm{d}\bm{x}.

(B.3.6)

Then the entropy $h(\bm{x})$ is just

h(\bm{x})=\log_{2}\operatorname{vol}(\mathcal{S}).

(B.3.7)

The proof then concludes with a lemma that relates the ratio $\operatorname{vol}(\mathcal{S})/\operatorname{vol}(B_{\epsilon})$ to the $\epsilon$ -covering number of $\mathcal{S}$ by $\epsilon$ balls.

To extend the program above to degenerate distributions satisfying Assumption B.3, our proof of the lower bound in Theorem 3.6 will leverage an approximation argument of the actual low-dimensional distribution $\bm{x}$ by “nearby” distributions which have densities, similarly but not exactly the same as the proof sketch preceding Theorem B.1. We will then link the parameter introduced in the approximating sequence to the distortion parameter $\epsilon$ in order to obtain the desired conclusion in Theorem 3.6.

Definition B.1.

Let $\mathcal{S}$ be a compact set. For any $\delta>0$ , define the $\delta$ -thickening of $\mathcal{S}$ , denoted $\mathcal{S}_{\delta}$ , by

\mathcal{S}_{\delta}=\left\{\bm{\xi}\in\mathbb{R}^{D}\mid\mathrm{dist}(\bm{\xi},\mathcal{S})\leq\delta\right\}.

(B.3.8)

The distance function referenced in Definition B.1 is defined by

\operatorname{dist}(\bm{\xi},\mathcal{S})=\inf_{\bm{\xi}^{\prime}\in\mathcal{S}}\left\|\bm{\xi}-\bm{\xi}^{\prime}\right\|_{2}.

(B.3.9)

For a compact set $\mathcal{S}$ , Weierstrass’s theorem implies that for any $\bm{\xi}\in\mathbb{R}^{D}$ , there is always some $\bm{\xi}^{\prime}\in\mathcal{S}$ attaining the infimum in the distance function. Compactness of $\mathcal{S}_{\delta}$ follows readily from compactness of $\mathcal{S}$ , so $\operatorname{vol}(\mathcal{S}_{\delta})$ is finite for any $\delta>0$ . It is then possible to make the following definition of a thickened random variable, specialized to Assumption B.3.

Definition B.2.

Let $\bm{x}$ be a random variable such that $\operatorname{Supp}(\bm{x})=\mathcal{S}$ is a union of $K$ hyperspheres, distributed as in Assumption B.3. Denote the support of each component of the mixture by $\mathcal{S}_{k}$ . Define the thickened random variable $\bm{x}_{\delta}$ as the mixture of measures where each component measure is uniform on the thickened set $\mathcal{S}_{k,\delta}$ (Definition B.1), for $k\in[K]$ , with mixing weights $\pi_{k}$ .

Lemma B.6.

Suppose the random variable $\bm{x}$ satisfies Assumption B.3. Then if $0<\delta<\tfrac{1}{2}$ , the thickened random variable $\bm{x}_{\delta}$ (Definition B.2) satisfies for any $\epsilon>0$

R_{\delta+\epsilon}(\bm{x}_{\delta})\leq R_{\epsilon}(\bm{x}).

(B.3.10)

The proof of Lemma B.6 is deferred to Section B.3.2. Using Lemma B.6, the above program can be realized, because the random variable $\bm{x}_{\delta}$ has a density that is uniform with respect to the Lebesgue measure.

(Proof of Theorem 3.6).

The upper bound is readily shown. If $S$ is any $\epsilon$ -cover of the support of $\bm{x}$ with cardinality $\mathcal{N}_{\epsilon}(\operatorname{Supp}(\bm{x}))$ , then consider the coding scheme assigning to each $\bm{\xi}\in\operatorname{Supp}(\bm{x})$ the reconstruction $\hat{\bm{\xi}}=\operatorname*{arg\ min}_{\bm{\xi}^{\prime}\in S}\,\|\bm{\xi}-\bm{\xi}^{\prime}\|_{2}$ , with ties broken arbitrarily. Then ties occur with probability zero, and the fact that $S$ covers $\operatorname{Supp}(\bm{x})$ at scale $\epsilon$ guarantees distortion no larger than $\epsilon$ ; the rate of this scheme is $\log_{2}\mathcal{N}_{\epsilon}(\operatorname{Supp}(\bm{x}))$ .

For the lower bound, let $0<\delta<\tfrac{1}{2}$ , and consider the thickened random variable $\bm{x}_{\delta}$ . By Lemma B.6, we have

R_{\delta+\epsilon}(\bm{x}_{\delta})\leq R_{\epsilon}(\bm{x}).

(B.3.11)

Since $\bm{x}_{\delta}$ has a Lebesgue density that is uniform, we can then apply the Shannon lower bound, in the form (B.3.5), to get

\log_{2}\operatorname{vol}(\operatorname{Supp}(\bm{x}_{\delta}))-\log_{2}\operatorname{vol}(B_{\delta+\epsilon})-O(\log D)\leq R_{\epsilon}(\bm{x}).

(B.3.12)

Finally, we need to lower bound the ratio

\frac{\operatorname{vol}(\operatorname{Supp}(\bm{x}_{\delta}))}{\operatorname{vol}(B_{\delta+\epsilon})}

(B.3.13)

in terms of the covering number. Since $\operatorname{Supp}(\bm{x}_{\delta})=\operatorname{Supp}(\bm{x})+B_{\delta}$ , where $+$ here denotes the Minkowski sum, a standard application of volume bound arguments (see e.g. [Ver18, Proposition 4.2.12]) gives

\operatorname{vol}(\operatorname{Supp}(\bm{x}_{\delta}))\geq\mathcal{N}_{2\delta}(\operatorname{Supp}(\bm{x}))\operatorname{vol}(B_{\delta}).

(B.3.14)

Hence

	$\displaystyle\frac{\operatorname{vol}(\operatorname{Supp}(\bm{x}_{\delta}))}{\operatorname{vol}(B_{\delta+\epsilon})}$	$\displaystyle\geq\mathcal{N}_{2\delta}(\operatorname{Supp}(\bm{x}))\frac{\operatorname{vol}(B_{\delta})}{\operatorname{vol}(B_{\delta+\epsilon})}$		(B.3.15)
		$\displaystyle=\mathcal{N}_{2\delta}(\operatorname{Supp}(\bm{x}))\left(\frac{\delta}{\delta+\epsilon}\right)^{D}.$		(B.3.16)

Choosing $\delta=\epsilon/2$ gives from the Shannon lower bound (B.3.12) and the above estimates

\log_{2}\mathcal{N}_{\epsilon}(\operatorname{Supp}(\bm{x}))-O(D)\leq R_{\epsilon}(\bm{x}),

(B.3.17)

as was to be shown.

∎

B.3.2 Proof of Lemma B.6

(Proof of Lemma B.6).

It suffices to show that any code for $\bm{x}$ with expected squared distortion $\epsilon^{2}$ produces a code for $\bm{x}_{\delta}$ with the same rate and distortion not much larger, for a suitable choice of $\delta$ . So fix such a code for $\bm{x}$ , achieving rate $R$ and expected squared distortion $\epsilon^{2}$ . We write $\hat{\bm{x}}$ for the reconstructed random variable using this code, and $\mathrm{q}:\operatorname{Supp}(\bm{x})\to\operatorname{Supp}(\bm{x})$ for the associated encoding-decoding mapping (i.e., $\hat{\bm{x}}=\mathrm{q}(\bm{x})$ ).

Now let $\mathcal{S}_{k}$ denote the $k$ -th hypersphere in the support of $\bm{x}$ . There is an orthonormal basis $\bm{U}_{k}\in\mathbb{R}^{D\times d_{k}}$ such that $\operatorname{Span}(\mathcal{S}_{k})=\operatorname{Span}(\bm{U}_{k})$ . The following orthogonal decomposition of the support set $\mathcal{S}$ will be used repeatedly throughout the proof. We have

	$\displaystyle\mathcal{S}_{\delta}$	$\displaystyle=\{\bm{\xi}\in\mathbb{R}^{D}\mid\exists k\in[K]\>:\>\operatorname{dist}(\bm{\xi},\mathcal{S}_{k})\leq\delta\}$		(B.3.18)
		$\displaystyle=\bigcup_{k\in[K]}\{\bm{\xi}\in\mathbb{R}^{D}\mid\operatorname{dist}(\bm{\xi},\mathcal{S}_{k})\leq\delta\}.$		(B.3.19)

By orthogonal projection, for any $k\in[K]$ any $\bm{\xi}\in\mathbb{R}^{D}$ can be written as $\bm{\xi}=\bm{\xi}^{\|}+\bm{\xi}^{\perp}$ , with $\bm{\xi}^{\|}\in\operatorname{Span}(\mathcal{S}_{k})$ and $\langle\bm{\xi}^{\|},\bm{\xi}^{\perp}\rangle=0$ . Then for any $\bm{\xi}^{\prime}\in\mathcal{S}_{k}$ , we have

$\displaystyle\left\\|\bm{\xi}-\bm{\xi}^{\prime}\right\\|_{2}^{2}=\left\\|\bm{\xi}^{\\|}+\bm{\xi}^{\perp}-\bm{\xi}^{\prime}\right\\|_{2}^{2}$	$\displaystyle=\left\\|\bm{\xi}^{\\|}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\perp}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\prime}\right\\|_{2}^{2}-2\left\langle\bm{\xi}^{\\|},\bm{\xi}^{\prime}\right\rangle$	(B.3.20)
	$\displaystyle\geq\left\\|\bm{\xi}^{\\|}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\prime}\right\\|_{2}^{2}-2\left\langle\bm{\xi}^{\\|},\bm{\xi}^{\prime}\right\rangle$	(B.3.21)
	$\displaystyle=\left\\|\bm{\xi}^{\\|}-\bm{\xi}^{\prime}\right\\|_{2}^{2}.$	(B.3.22)

Further, it is known that for any nonzero $\bm{\xi}^{\|}\in\operatorname{Span}(\mathcal{S}_{k})$ ,

\inf_{\bm{\xi}^{\prime}\in\mathcal{S}_{k}}\,\left\|\bm{\xi}^{\|}-\bm{\xi}^{\prime}\right\|_{2}^{2}=\left\|\bm{\xi}^{\|}-\frac{\bm{\xi}^{\|}}{\|\bm{\xi}^{\|}\|_{2}}\right\|_{2}^{2}.

(B.3.23)

If $\bm{\xi}^{\|}$ is zero, it is clear that the above distance is equal to $1$ for every $\bm{\xi}^{\prime}\in\mathcal{S}_{k}$ . Hence, if we define a projection mapping $\pi_{\mathcal{S}_{k}}(\bm{\xi})$ by

\pi_{S_{k}}(\bm{\xi})=\frac{\bm{U}_{k}\bm{U}_{k}^{\top}\bm{\xi}}{\|\bm{U}_{k}^{\top}\bm{\xi}\|_{2}}

(B.3.24)

for any $\bm{\xi}\in\mathbb{R}^{D}$ with $\bm{U}_{k}^{\top}\bm{\xi}\neq\mathbf{0}$ , then $\pi_{\mathcal{S}_{k}}(\bm{\xi})=\operatorname*{arg\ min}_{\bm{\xi}^{\prime}\in\mathcal{S}_{k}}\left\|\bm{\xi}^{\prime}-\bm{\xi}\right\|_{2}$ . We choose $0<\delta<1$ , so that the thickened set $\mathcal{S}_{\delta}$ contains no points $\bm{\xi}\in\mathbb{R}^{D}$ at which any of the projection maps $\pi_{\mathcal{S}_{k}}$ is not well-defined. So the thickened set $\mathcal{S}_{\delta}$ satisfies

\displaystyle\mathcal{S}_{\delta}

\displaystyle=\bigcup_{k\in[K]}\left\{\bm{\xi}\in\mathbb{R}^{D}\mid\left\|\bm{\xi}-\frac{\bm{U}_{k}\bm{U}_{k}^{\top}\bm{\xi}}{\|\bm{U}_{k}^{\top}\bm{\xi}\|_{2}}\right\|_{2}\leq\delta\right\}.

(B.3.25)

These distances can be rewritten in terms of the orthogonal decomposition as

$\displaystyle\left\\|\bm{\xi}-\frac{\bm{U}_{k}\bm{U}_{k}^{\top}\bm{\xi}}{\\|\bm{U}_{k}^{\top}\bm{\xi}\\|_{2}}\right\\|_{2}^{2}$	$\displaystyle=\left\\|\bm{\xi}\right\\|_{2}^{2}-2\\|\bm{U}_{k}^{\top}\bm{\xi}\\|_{2}+1$	(B.3.26)
	$\displaystyle=\left\\|\bm{\xi}^{\\|}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\perp}\right\\|_{2}^{2}-2\\|\bm{U}_{k}^{\top}\bm{\xi}^{\\|}\\|_{2}+1$	(B.3.27)
	$\displaystyle=\left\\|\bm{\xi}^{\perp}\right\\|_{2}^{2}+\left(\left\\|\bm{\xi}^{\\|}\right\\|_{2}-1\right)^{2}.$	(B.3.28)

We are going to show next that every such $\bm{\xi}\in\mathcal{S}_{\delta}$ can be uniquely associated to a projection onto a single subspace in the mixture, which will allow us to define a corresponding projection onto $\mathcal{S}$ . Given a $\bm{\xi}\in\mathcal{S}_{\delta}$ , by the above, we can find a subspace $\bm{U}_{k}$ such that the orthogonal decomposition $\bm{\xi}=\bm{\xi}^{\|}_{k}+\bm{\xi}^{\perp}_{k}$ satisfies

\left\|\bm{\xi}^{\perp}_{k}\right\|_{2}^{2}+\left(\left\|\bm{\xi}^{\|}_{k}\right\|_{2}-1\right)^{2}\leq\delta^{2}.

(B.3.29)

Consider the decomposition $\bm{\xi}=\bm{\xi}_{j}^{\|}+\bm{\xi}_{j}^{\perp}$ for some $j\neq k$ . We have

$\displaystyle\left\\|\bm{\xi}_{j}^{\\|}\right\\|_{2}=\left\\|\bm{U}_{j}\bm{U}_{j}^{\top}\bm{\xi}\right\\|_{2}$	$\displaystyle=\left\\|\bm{U}_{j}\bm{U}_{j}^{\top}(\bm{U}_{k}\bm{U}_{k}^{\top}\bm{\xi}+(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{\top})\bm{\xi})\right\\|_{2}$	(B.3.30)
	$\displaystyle=\left\\|\bm{U}_{j}\bm{U}_{j}^{\top}(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{\top})\bm{\xi}\right\\|_{2}$	(B.3.31)
	$\displaystyle\leq\left\\|(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{\top})\bm{\xi}\right\\|_{2}=\left\\|\bm{\xi}_{k}^{\perp}\right\\|_{2}\leq\delta,$	(B.3.32)

where the second line uses the orthogonality assumption on the subspaces $\bm{U}_{k}$ , and the third uses the fact that orthogonal projections are nonexpansive. Hence, the $j$ -th distance satisfies

\left\|\bm{\xi}^{\perp}_{j}\right\|_{2}^{2}+\left(\left\|\bm{\xi}^{\|}_{j}\right\|_{2}-1\right)^{2}\geq(1-\delta)^{2}.

(B.3.33)

This implies that if $0<\delta<1/2$ , every $\bm{\xi}\in\mathcal{S}_{\delta}$ has a unique closest subspace in the mixture. Hence, under this condition, the following mapping $\pi_{\mathcal{S}}:\mathcal{S}_{\delta}\to\mathcal{S}$ is well-defined:

\pi_{\mathcal{S}}(\bm{\xi})=\pi_{\mathcal{S}_{k_{\star}}}(\bm{\xi}),\enspace\text{where}\enspace k_{\star}=\operatorname*{arg\ min}_{k\in[K]}\,\operatorname{dist}(\bm{\xi},\mathcal{S}_{k}).

(B.3.34)

Now, we define a code for $\bm{x}_{\delta}$ by

\hat{\bm{x}}_{\delta}=\mathrm{q}(\pi_{\mathcal{S}}(\bm{x}_{\delta})).

(B.3.35)

Clearly this is associated to a rate- $R$ code for $\bm{x}_{\delta}$ , because it uses the encoding-decoding mappings from the rate- $R$ code for $\bm{x}$ . We have to show that it achieves small distortion. We calculate

	$\displaystyle\mathbb{E}\left[\left\\|\bm{x}_{\delta}-\hat{\bm{x}}_{\delta}\right\\|_{2}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\left\\|\bm{x}_{\delta}-\mathrm{q}(\pi_{\mathcal{S}}(\bm{x}_{\delta}))\right\\|_{2}^{2}\right]$		(B.3.36)
		$\displaystyle\leq\left(\mathbb{E}\left[\left\\|\bm{x}_{\delta}-\pi_{\mathcal{S}}(\bm{x}_{\delta})\right\\|_{2}^{2}\right]^{1/2}+\mathbb{E}\left[\left\\|\pi_{\mathcal{S}}(\bm{x}_{\delta})-\mathrm{q}(\pi_{\mathcal{S}}(\bm{x}_{\delta}))\right\\|_{2}^{2}\right]^{1/2}\right)^{2},$		(B.3.37)

where the inequality uses the Minkowski inequality. Now, by Definitions B.1 and B.2, we have deterministically

\left\|\bm{x}_{\delta}-\pi_{\mathcal{S}}(\bm{x}_{\delta})\right\|_{2}^{2}\leq\delta^{2},

(B.3.38)

so the expectation also satisfies this estimate. For the second term, it will suffice to characterize the density of the random variable $\pi_{\mathcal{S}}(\bm{x}_{\delta})$ as being sufficiently close to the density of $\bm{x}$ —which, as Assumption B.3 implies, is a mixture of uniform distributions on each sub-sphere $\mathcal{S}_{k}$ . By the argument above, every point $\bm{\xi}\in\mathcal{S}_{\delta}$ can be associated to one and only one subspace $\bm{U}_{k}$ , which means that the mixture components in the definition of $\mathcal{S}_{\delta}$ (recall Definition B.2) do not overlap. Hence, the density $\pi_{\mathcal{S}}(\bm{x}_{\delta})$ can be characterized by studying the effect of $\pi_{\mathcal{S}_{k}}$ on the conditional random variable $\bm{x}_{\delta}$ , conditioned on being drawn from $\mathcal{S}_{k,\delta}$ . Denote this measure by $\mu_{k,\delta}$ . We claim that the pushforward of this measure under $\pi_{\mathcal{S}_{k}}$ is uniform on $\mathcal{S}_{k}$ . To see that this holds, we recall Equation B.3.28, which gives the characterization

\mathcal{S}_{k,\delta}=\left\{\bm{\xi}^{\|}+\bm{\xi}^{\perp}\mid\bm{\xi}^{\|}\in\operatorname{Span}(\bm{U}_{k}),\bm{\xi}^{\perp}\in\operatorname{Span}(\bm{U}_{k})^{\perp},\left\|\bm{\xi}^{\perp}\right\|_{2}^{2}+\left(\left\|\bm{\xi}^{\|}\right\|_{2}-1\right)^{2}\leq\delta\right\}.

(B.3.39)

The conditional distribution in question is uniform on this set; we need to show that the projection $\pi_{\mathcal{S}_{k}}$ applied to this conditional random variable yields a random variable that is uniform on $\mathcal{S}_{k}$ . With respect to these coordinates, we have seen that $\pi_{\mathcal{S}_{k}}(\bm{\xi}^{\|}+\bm{\xi}^{\perp})=\bm{\xi}^{\|}/\|\bm{\xi}^{\|}\|_{2}$ . Hence, for any $\bm{\xi}\in\mathcal{S}_{k}$ , we have that the preimage of $\bm{\xi}$ in $\mathcal{S}_{k,\delta}$ under $\pi_{\mathcal{S}_{k}}$ is

\pi_{\mathcal{S}_{k}}^{-1}(\bm{\xi})=\left\{r\bm{\xi}+\bm{\xi}^{\perp}\mid r>0,\bm{\xi}^{\perp}\in\operatorname{Span}(\bm{U}_{k})^{\perp},\left\|\bm{\xi}^{\perp}\right\|_{2}^{2}+\left(r-1\right)^{2}\leq\delta\right\}.

(B.3.40)

To show that $(\pi_{\mathcal{S}_{k}})_{\sharp}\mu_{k,\delta}$ is uniform, we need to decompose the integral of the uniform density on $\mathcal{S}_{k,\delta}$ in a way that makes it clear that each of the fibers $\pi_{\mathcal{S}_{k}}^{-1}(\bm{\xi})$ (for each $\bm{\xi}\in\mathcal{S}_{k}$ ) “contributes” equally to the integral.⁶⁶6More rigorously, this corresponds to decomposing the uniform density on $\mathcal{S}_{k,\delta}$ into a regular conditional density corresponding to $\bm{\xi}\in\mathcal{S}_{k}$ , and showing that the corresponding density on $\bm{\xi}$ is uniform. The proof makes it clear this is true. We have by Definition B.2

\operatorname{vol}(\mathcal{S}_{k,\delta})=\iint_{\operatorname{Span}(\bm{U}_{k})\times\operatorname{Span}(\bm{U}_{k})^{\perp}}\mathbf{1}_{\left\|\bm{\xi}^{\perp}\right\|_{2}^{2}+\left(\left\|\bm{\xi}^{\|}\right\|_{2}-1\right)^{2}\leq\delta}\mathrm{d}\bm{\xi}^{\|}\mathrm{d}\bm{\xi}^{\perp}.

(B.3.41)

In particular, the integration over the orthogonal coordinates factors. Let $\mathrm{d}\bm{\theta}^{d}$ denote the uniform (Haar) measure on the sphere of radius $1$ in $\mathbb{R}^{d}$ . Converting the $\bm{\xi}^{\|}$ integral to polar coordinates, we have

\operatorname{vol}(\mathcal{S}_{k,\delta})=\int_{[0,\infty)}\int_{\mathbb{S}^{d_{k}-1}}\int_{\operatorname{Span}(\bm{U}_{k})^{\perp}}r^{d_{k}-1}\mathbf{1}_{\left\|\bm{\xi}^{\perp}\right\|_{2}^{2}+\left(r-1\right)^{2}\leq\delta}\mathrm{d}r\mathrm{d}\bm{\theta}^{d_{k}}\mathrm{d}\bm{\xi}^{\perp}.

(B.3.42)

Comparing to the fiber representation (B.3.40), we see that we need to “integrate out” over the $r$ and $\bm{\xi}^{\perp}$ components of the preceding integral in order to verify that the pushforward is uniform. But this is evident, as the previous expression shows that the value of this integral is independent of $\bm{\xi}^{\|}$ —or, equivalently in context, the value of the spherical component $\bm{\theta}^{d_{k}}$ .

Thus it follows from the above argument that $\pi_{\mathcal{S}}(\bm{x}_{\delta})$ is uniform. Because the assumption on $\delta$ implies that the mixture components in the distribution of $\bm{x}_{\delta}$ do not overlap, the mixing weights $\pi_{k}$ are also preserved in the image $\pi_{\mathcal{S}}(\bm{x}_{\delta})$ , and in particular, the distribution of $\pi_{\mathcal{S}}(\bm{x}_{\delta})$ is equal to the distribution of $\bm{x}$ . Hence the second term in Equation B.3.37 satisfies

\mathbb{E}\left[\left\|\pi_{\mathcal{S}}(\bm{x}_{\delta})-\mathrm{q}(\pi_{\mathcal{S}}(\bm{x}_{\delta}))\right\|_{2}^{2}\right]=\mathbb{E}\left[\left\|\bm{x}-\mathrm{q}(\bm{x})\right\|_{2}^{2}\right]\leq\epsilon^{2},

(B.3.43)

because $\mathrm{q}$ is a distortion- $\epsilon$ code for $\bm{x}$ .

We have thus shown that the hypothesized rate- $R$ , (expected squared) distortion- $\epsilon^{2}$ code for $\bm{x}$ produces a rate- $R$ , (expected squared) distortion $\delta+\epsilon$ code for $\bm{x}_{\delta}$ . This establishes that

R_{\delta+\epsilon}(\bm{x}_{\delta})\leq R_{\epsilon}(\bm{x}),

(B.3.44)

as was to be shown.

∎

$\displaystyle\left\\|\bm{\xi}-\bm{\xi}^{\prime}\right\\|_{2}^{2}=\left\\|\bm{\xi}^{\\|}+\bm{\xi}^{\perp}-\bm{\xi}^{\prime}\right\\|_{2}^{2}$	$\displaystyle=\left\\|\bm{\xi}^{\\|}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\perp}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\prime}\right\\|_{2}^{2}-2\left\langle\bm{\xi}^{\\|},\bm{\xi}^{\prime}\right\rangle$	(B.3.20)
	$\displaystyle\geq\left\\|\bm{\xi}^{\\|}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\prime}\right\\|_{2}^{2}-2\left\langle\bm{\xi}^{\\|},\bm{\xi}^{\prime}\right\rangle$	(B.3.21)
	$\displaystyle=\left\\|\bm{\xi}^{\\|}-\bm{\xi}^{\prime}\right\\|_{2}^{2}.$	(B.3.22)

$\displaystyle\left\\|\bm{\xi}-\frac{\bm{U}_{k}\bm{U}_{k}^{\top}\bm{\xi}}{\\|\bm{U}_{k}^{\top}\bm{\xi}\\|_{2}}\right\\|_{2}^{2}$	$\displaystyle=\left\\|\bm{\xi}\right\\|_{2}^{2}-2\\|\bm{U}_{k}^{\top}\bm{\xi}\\|_{2}+1$	(B.3.26)
	$\displaystyle=\left\\|\bm{\xi}^{\\|}\right\\|_{2}^{2}+\left\\|\bm{\xi}^{\perp}\right\\|_{2}^{2}-2\\|\bm{U}_{k}^{\top}\bm{\xi}^{\\|}\\|_{2}+1$	(B.3.27)
	$\displaystyle=\left\\|\bm{\xi}^{\perp}\right\\|_{2}^{2}+\left(\left\\|\bm{\xi}^{\\|}\right\\|_{2}-1\right)^{2}.$	(B.3.28)

$\displaystyle\left\\|\bm{\xi}_{j}^{\\|}\right\\|_{2}=\left\\|\bm{U}_{j}\bm{U}_{j}^{\top}\bm{\xi}\right\\|_{2}$	$\displaystyle=\left\\|\bm{U}_{j}\bm{U}_{j}^{\top}(\bm{U}_{k}\bm{U}_{k}^{\top}\bm{\xi}+(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{\top})\bm{\xi})\right\\|_{2}$	(B.3.30)
	$\displaystyle=\left\\|\bm{U}_{j}\bm{U}_{j}^{\top}(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{\top})\bm{\xi}\right\\|_{2}$	(B.3.31)
	$\displaystyle\leq\left\\|(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{\top})\bm{\xi}\right\\|_{2}=\left\\|\bm{\xi}_{k}^{\perp}\right\\|_{2}\leq\delta,$	(B.3.32)

	$\displaystyle\mathbb{E}\left[\left\\|\bm{x}_{\delta}-\hat{\bm{x}}_{\delta}\right\\|_{2}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\left\\|\bm{x}_{\delta}-\mathrm{q}(\pi_{\mathcal{S}}(\bm{x}_{\delta}))\right\\|_{2}^{2}\right]$		(B.3.36)
		$\displaystyle\leq\left(\mathbb{E}\left[\left\\|\bm{x}_{\delta}-\pi_{\mathcal{S}}(\bm{x}_{\delta})\right\\|_{2}^{2}\right]^{1/2}+\mathbb{E}\left[\left\\|\pi_{\mathcal{S}}(\bm{x}_{\delta})-\mathrm{q}(\pi_{\mathcal{S}}(\bm{x}_{\delta}))\right\\|_{2}^{2}\right]^{1/2}\right)^{2},$		(B.3.37)

Appendix B Entropy, Diffusion, Denoising, and Lossy Coding

Assumption B.1.

B.1 Differential Entropy of Low-Dimensional Distributions

Theorem B.1.

B.2 Diffusion and Denoising Processes

Assumption B.2.

B.2.1 Diffusion Process Increases Entropy Over Time

Theorem B.2 (Diffusion Increases Entropy).

Proof.

B.2.2 Denoising Process Reduces Entropy Over Time

Theorem B.3.

Proof.

B.2.3 Technical Lemmas and Intermediate Results

Finitneness of the Differential Entropy

Lemma B.1.

Proof.

Integration by Parts in De Brujin Identity

Lemma B.2.

Proof.

Local Invertibility of the Denoiser 𝒙¯\bar{\bm{x}}over¯ start_ARG bold_italic_x end_ARG

Lemma B.3 (Generalization of [Gri11], Lemma A.1).

Proof.

Lemma B.4 (Generalization of [Gri11] Corollary A.2, Part 1).

Proof.

Corollary B.1 (Generalization of [Gri11] Corollary A.2, Part 2).

Proof.

Controlling the Laplacian Δ​log⁡pt\Delta\log p_{t}roman_Δ roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Lemma B.5.

Proof.

Derivative Computations

Proposition B.1.

Proof.

Proposition B.2.

Proof.

Differentiating Under the Integral Sign

Proposition B.3 ([Jon82], Section 11.12).

Proposition B.4 ([BB11], Proposition 4.20).

B.3 Lossy Coding and Sphere Packing

Assumption B.3.

B.3.1 Proof of Relationship Between Rate Distortion and Covering

Definition B.1.

Definition B.2.

Lemma B.6.

(Proof of Theorem 3.6).

B.3.2 Proof of Lemma B.6

(Proof of Lemma B.6).

Local Invertibility of the Denoiser $\bar{\bm{x}}$

Controlling the Laplacian $\Delta\log p_{t}$