Appendix B Entropy, Diffusion, Denoising, and Lossy Coding
“The increase of disorder or entropy with time is one example of
what is called an arrow of time, something that distinguishes the past
from the future, giving a direction to time.”
\(~\) – A Brief History of Time, Stephen Hawking
In this appendix we provide proofs for several facts, mentioned in Chapter 3, which
are related to differential entropy, how it evolves under diffusion processes, and its
connections to lossy coding. We will make the following mild assumption about the
random variable representing the data source, denoted \(\vx \).
Assumption B.1.\(\vx \) is supported on a compact set \(\cS \subseteq \R ^{D}\) of radius at most \(R\), i.e., \(R \doteq \sup _{\vxi \in \cS }\norm {\vxi }_{2}\).
In particular, since compact sets in Euclidean space are bounded, it holds \(R < \infty \). We
will consistently use the notation \(B_{r}(\vxi ) \doteq \{\vu \in \R ^{D} \colon \norm {\vxi - \vu }_{2} \leq r\}\) to denote the Euclidean ball of radius \(r\) centered at
\(\vxi \). In this sense, Assumption B.1 has \(\cS \subseteq B_{R}(\vzero )\).
Notice that this assumption holds for (almost) all variables we care about
in practice, as it is (often) imposed by a normalization step during data
pre-processing.
B.1 Differential Entropy of Low-Dimensional Distributions
In this short appendix we discuss the differential entropy of low-dimensional
distributions. By definition, the differential entropy of a random variable \(\vx \) which
does not have a density is \(-\infty \); this includes all random variables supported on
low-dimensional sets. The objective of this section is to discuss why this is a “morally
correct” value.
In fact, let \(\vx \) be any random variable such that Assumption B.1 holds, the support \(\cS \) of \(\vx \) has \(0\)
volume.1 We will consider
the case that \(\vx \) is uniform on \(\cS \).2
Our goal is to compute \(h(\vx )\).
In this case, \(\vx \) would not have a density; in the counterfactual world where we did
not know \(h(\vx ) = -\infty \), we could not directly define it using the standard definition of differential
entropy. Instead, as in the rest of analysis and information theory it would be
reasonable to consider the limit of entropies of successively better approximations \(\vx _{\eps }\) of \(\vx \)
which have densities, i.e.,
Figure B.1: Illustration of the \(\eps \)-thickening \(\cS _\eps \) of a curve \(\cS \subseteq \R ^{2}\).
We will work with random variables whose support is \(\cS _{\eps }\), which is fully-dimensional,
and take the limit as \(\eps \to 0\). Indeed, define \(\vx _{\eps } \sim \dUnif (\cS _{\eps })\). Since \(\cS _{\eps }\) has positive volume, \(\vx _{\eps }\) has a density \(p_{\eps }\)
equal to
The above calculation is actually a corollary of a much more famous and
celebrated set of results about the maximum possible entropy of \(\vx \) subject
to certain constraints on the distribution of \(\vx \). We would be remiss to not
provide the results here; the proofs are provided in Chapter 2 of [PW22], for
example.
Theorem B.1.Let \(\vx \) be a random variable on \(\R ^{D}\).
1.
If \(\vx \) is supported on a compact set \(\cS \subseteq \R ^{D}\) (i.e., Assumption B.1) then
If \(\vx \) has finite covariance such that, for a PSD matrix \(\vSigma \in \PSD (D)\), it holds \(\Cov (\vx ) \preceq \vSigma \) (w.r.t. the PSD
ordering, i.e., \(\vSigma - \Cov (\vx )\) is PSD), then
where \(\vg \sim \dNorm (\vzero , \vI )\) independently of \(\vx \).
The structure of this section is as follows. In Section B.2.1 we provide a formal
theorem and crisp proof which shows that under Equation (B.2.1) the entropy
increases, i.e., \(\frac{\mathrm{d}}{\mathrm{d}t} h(\vx _{t}) > 0\). In Section B.2.2 we provide a formal theorem and crisp proof which
shows that under Equation (B.2.1), the entropy decreases during denoising, i.e., \(h(\Ex [\vx _{s} \given \vx _{t}]) < h(\vx _{t})\) for
all \(s < t\). In Section B.2.3 we provide proofs for technical lemmas that are needed to
establish the claims in the previous subsections.
Before we start, we introduce some key notations. First, let \(\phi _{t}\) be the density of \(\dNorm (\vzero , t^{2}\vI )\),
i.e.,
Next, \(\vx _{t}\) is supported on all of \(\R ^{D}\), so it has a density, which we denote \(p_{t}\) (as in the
main body). A quick calculation shows that
and from this representation
it is possible to deduce (i.e., from Proposition B.4) that \(p_{t}\) is smooth (i.e.,
infinitely differentiable) in \(\vxi \), also smooth in \(t\), and positive everywhere. This
fact is somewhat remarkable at first sight: even for a completely irregular
random variable \(\vx \) (say, a Bernoulli random variable, which does not have a
density), its Gaussian smoothing admits a density for every (arbitrarily small)
\(t > 0\). The proof is left as an exercise for readers well-versed in mathematical
analysis.
However, we also need to add an assumption about the smoothness of the distribution
of \(\vx \), which will eliminate some technical problems that occur around \(t = 0\) with low-dimensional
distributions.3
Despite this, we expect that our results hold under milder assumptions with
additional work. For now, let us assume:
Assumption B.2.\(\vx \) has a twice continuously differentiable density, denoted \(p\).
B.2.1 Diffusion Process Increases Entropy Over Time
In this section we provide a proof of Theorem B.2. For convenience, we restate it as
follows.
Theorem B.2 (Diffusion Increases Entropy).Let \(\vx \) be any random variable such thatAssumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). Then
\begin{equation}\tag{B.2.4}\label {eq:diffusion_entropy_increases} h(\vx _{s}) < h(\vx _{t}), \qquad \forall s, t \colon 0 \leq s < t \leq T. \end{equation}
Proof.Before we start, we must ask: when does the inequality in (B.2.4) make
sense? We will show in Lemma B.1 that under our assumptions, the differential
entropy is well-defined, is never \(+\infty \), and for \(t > 0\) is finite, so the (strict) inequality in
(B.2.4) makes sense.
The question of well-definedness aside, the crux of this proof is to show that
the density \(p_{t}\) of \(\vx _{t}\) satisfies a particular partial differential equation, which is very
similar to the heat equation. The heat equation is a famous PDE which describes
the diffusion of heat through space. This intuitively should make sense, and
paints a mental picture: as the time \(t\) increases, the probability from the original
(perhaps tightly concentrated) \(\vx \) disperses across all of \(\R ^{D}\) like heat radiating from
a source in a vacuum.
Such PDEs for \(p_{t}\), known as Fokker-Planck equations for more general
stochastic processes, are very powerful tools, as they allow us to describe the
instantaneous temporal derivatives of \(p_{t}\) in terms of the instantaneous spatial
derivatives of \(p_{t}\), and vice versa, providing a concise description of the regularity
and dynamics of \(p_{t}\). Once we obtain dynamics for \(p_{t}\), we can then use the system
to obtain another one which describes the dynamics of \(h(\vx _{t})\), which after all is just
a functional of \(p_{t}\).
The description of the PDE involves a mathematical object called the Laplacian \(\Delta \).
Recall from your multivariable calculus class that the Laplacian operating on a
differentiable-in-time and twice-differentiable-in-space function \(f \colon (0, T) \times \R ^{D} \to \R \) is given by
Namely, from using the integral representation of \(p_{t}\) and differentiating under the
integral, we can compute the derivatives of \(p_{t}\) (which we do in Proposition B.1) and
observe that \(p_{t}\) satisfies the heat-like PDE
where strict inequality holds in the last line because, for it to not hold, \(\nabla p_{t}(\vxi )\) would
need to vanish almost everywhere (i.e., everywhere except possibly on a set of zero
volume), but this would imply that \(p_{t}\) would be constant almost everywhere, a
contradiction with a fact that \(p_{t}\) is a density.
To complete the proof we just use the fundamental theorem of calculus
which
proves the claim. (Note that this does not make sense when \(h(\vx _{s}) = -\infty \), which can only
happen when \(s = 0\) and \(h(\vx ) = -\infty \), but in this case \(h(\vx _{t}) > -\infty \) so the claim is vacuously true anyways.)
□
B.2.2 Denoising Process Reduces Entropy Over Time
Recall that in Section 3.2.1 we start with the random variable \(\vx _{T}\) and iteratively
denoise it using iterations of the form
for \(s, t \in \{t_{0}, t_{1}, \dots , t_{L}\}\) with \(s < t\) and \(\vx _{T} = \hat {\vx }_{T}\). We wish to prove that \(h(\hat {\vx }_{s}) < h(\hat {\vx }_{t})\),
showing that the denoising process actually reduces the entropy.
Before we go about doing this, we make several remarks about the problem
statement. First, Tweedie’s formula (3.2.23) says that
which likens a full denoising
step from time \(t\) to time \(0\) to a gradient step on the log-density of \(\vx _{t}\). Can we get a
similar result for the full denoising step from time \(t\) to time \(s\) in (B.2.15)? It turns out
that indeed we can, and it is pretty simple. By using (B.2.15) and Tweedie’s formula
(3.2.23), we obtain
So this iterative denoising step is again a gradient step on the
perturbed log-density \(\log p_{t}\) with a shrunken step size. In particular, this step can be
seen as a perturbation of the distribution of the random variable \(\vx _{t}\) by the
score function vector field, suggesting a connection to stochastic differential
equations (SDEs) and the theory of diffusion models [SSK+21]. Indeed, a proof
of the following result Theorem B.3 can be developed using this powerful
machinery and a limiting argument (e.g., following the technical approach in the
exposition of [CCL+23]). We will give a simpler proof here, which will use only
elementary tools and thereby illuminate some of the key quantities behind the
process of entropy reduction via denoising. On the other hand, we will need
to deal with some slightly technical calculations due to the fact that the
denoising process in Theorem B.3 does not correspond exactly to the reverse
process associated to the noise addition process that generates the observation
\(\vx _{t}\).4
We want to prove that \(h(\Ex [\vx _{s} \mid \vx _{t}]) < h(\vx _{t})\), i.e., formally:
Theorem B.3.Let \(\vx \) be any random variable such that Assumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). For each \(t > 0\), write
First, write down a density for \(\Ex [\vx _{s} \mid \vx _{t}]\) using a change-of-variables formula.
2.
Second, bound this density to control the entropy.
The change of variables is justified by Corollary B.1, which was originally derived in
[Gri11].
We execute these ideas now. From Corollary B.1, we obtain that the function \(\bar {\vx }\)
defined as \(\bar {\vx }(\vxi ) \doteq \Ex [\vx _{s} \given \vx _{t} = \vxi ]\) is differentiable, injective, and thus invertible on its range, which we
henceforth denote \(\cX \subseteq \R ^{D}\). We denote its inverse as \(\bar {\vx }^{-1}\). Using a change-of-variables
formula, the density \(\bar {p}\) of \(\bar {\vx }(\vx _{t})\) is given by
where (recall, from Section A.2) \(\bar {\vx }^{\prime }\) is the
Jacobian of \(\bar {\vx }\). Since from Lemma B.3 we know \(\bar {\vx }^{\prime }\) is a positive definite matrix, the
determinant is positive and so the whole density is positive. Then it follows that
Setting \(\varepsilon \doteq \bp {1 - \frac {s}{t}}t^{2}U_{t}\), we have \(\bp {1 - \frac {s}{t}}t^{2}\Delta \log p_{t}(\vxi ) \leq \varepsilon \) for all \(\vxi \). By Taylor’s theorem with Lagrange
remainder, for any \(c \leq \varepsilon \),
since \(e^{c} = 1 + c + \frac {c^{2}}{2}e^{\theta c}\) for some \(\theta \in (0, 1)\), and \(e^{\theta c} \leq e^{\max (c, 0)} \leq e^{\varepsilon }\). Applying these bounds and integrating,
where the identity \(\int p_{t}\Delta \log p_{t} = -\int \norm {\nabla p_{t}}_{2}^{2}/p_{t}\) is the same as in the proof of Theorem B.2. Writing \(J(p_{t}) \doteq \int _{\R ^{D}}\frac {\norm {\nabla p_{t}(\vxi )}_{2}^{2}}{p_{t}(\vxi )}\odif {\vxi }\) for the
Fisher information of \(p_{t}\), and bounding \(\bp {\Delta \log p_{t}}^{2} \leq U_{t}^{2}\), we combine with our previous estimate to
obtain
This is strictly negative whenever \(\varepsilon ^{2}e^{\varepsilon } < 2\bp {1 - \frac {s}{t}}t^{2}J(p_{t})\). Substituting \(\varepsilon = \bp {1 - \frac {s}{t}}t^{2}U_{t}\), this is precisely condition
(B.2.20). □
Notice that condition (B.2.20) is always satisfiable: for any fixed \(t > 0\), the Fisher
information satisfies \(J(p_{t}) > 0\) (since \(p_{t}\) is not constant) and \(U_{t} < \infty \), so taking \(\bp {1 - \frac {s}{t}}t^{2}\) small enough ensures
(B.2.20) holds.
To understand the quantitative implications, consider the behavior as \(t \to 0\) (near the
data), where the condition is most restrictive. Setting \(s = (1 - \eps )t\), the condition becomes \(\eps t^{2}U_{t}^{2}\exp (\eps t^{2}U_{t}) < 2J(p_{t})\). For
small \(t\), the Laplacian bound satisfies \(U_{t} \approx R^{2}/t^{4}\), so the polynomial prefactor scales as \(\eps t^{2}U_{t}^{2} \approx \eps R^{4}/t^{6}\). The
critical scaling is therefore \(\eps \sim t^{6}/R^{4}\). Taking \(\eps = \alpha t^{6}/R^{4}\), the condition reduces to \(\alpha < 2J(p_{t})\). In other words, each
denoising step of size \(t - s = \eps \, t \sim t^{7}/R^{4}\) reduces the entropy. The steps must become increasingly fine
as \(t \to 0\), a consequence of the second-order remainder in the Taylor expansion used in the
proof.
At the same time, it should be noted that our bounds are likely loose in
terms of the precise \(t\)-dependence, and certainly so for data with greater
structure. For instance, the \(U_{t} \sim R^{2}/t^{4}\) blowup reflects a worst-case Laplacian bound that
can be considerably tightened under additional regularity assumptions on \(p\).
Moreover, if \(p_{t}\) is log-concave (as holds, e.g., when \(p\) itself is log-concave), then \(\Delta \log p_{t} \leq 0\)
everywhere, and a significantly simplified argument can be run with strengthened
conclusions.
B.2.3 Technical Lemmas and Intermediate Results
In this subsection we present technical results which power our main two conceptual
theorems. Our presentation will be more or less standard for mathematics; we will
start with the higher-level results first, and gradually move back to the more
incremental results. The higher-level results will use the incremental results, and in
this way we have an easy-to-read dependency ordering of the results: no
result depends on those before it. Results which do not depend on each
other are generally ordered by the place they appear in the above pair of
proofs.
Finiteness of the Differential Entropy
We first show that the entropy exists along the stochastic process and is
finite.
Lemma B.1.Let \(\vx \) be any random variable, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process
(B.2.1).
1.
For \(t > 0\), the differential entropy \(h(\vx _{t})\) exists and is \(> -\infty \).
2.
If in addition Assumption B.1holds for \(\vx \), then \(h(\vx ) < \infty \) and \(h(\vx _{t})\ < \infty \).
Proof.To prove Lemma B.1.1, we use a classic yet tedious analysis argument. Since \(\vx _{t}\)
has a density, we can write
and both integrals are guaranteed to be non-negative since their
integrands are.
In order to show that \(h(\vx _{t})\) is well-defined, we need to show that \(\int _{\R ^{D}}g_{+}(\vxi )\odif {\vxi } < \infty \) or \(\int _{\R ^{D}}g_{-}(\vxi ) \odif {\vxi } < \infty \). To show that \(h(\vx _{t}) > -\infty \), it
merely suffices to show that \(\int _{\R ^{D}}g_{-}(\vxi )\odif {\vxi } < \infty \). To bound the integral of \(g_{-}\) we need to understand the
quantity \(g_{-}\), namely, we want to characterize when \(g\) is negative.
In order to bound the integral of \(g_{-}(\vxi )\), we need to show that \(p_{t}\) is “not too
concentrated,” namely that \(p_{t}\) is not too large. To prove this, in this case we are
lucky enough to be able to bound the function \(g_{-}(\vxi )\) itself. Namely, notice that
Hence we have \(\int _{\R ^{D}}g_{-}(\vxi )\odif {\vxi } < \infty \), so the differential entropy \(h(\vx _{t})\) exists and is \(> -\infty \).
To prove Lemma B.1.2, suppose that Assumption B.1 holds. We want to
show that \(h(\vx ) < \infty \) and \(h(\vx _{t}) < \infty \). The mechanism for doing this is the same, and involves
the maximum entropy result Theorem B.1. Namely, since \(\vx \) is absolutely
bounded, it has a finite covariance which we will denote \(\vSigma \). Then the covariance of
\(\vx _{t}\) is \(\vSigma + t^{2}\vI \). Thus the entropy of \(\vx \) and \(\vx _{t}\) can be upper bounded by the entropy of
normal distributions with the respective covariances, i.e., \(\log [(2\pi e)^{D}\det (\vSigma )]\) or \(\log [(2\pi e)^{D}\det (\vSigma + t^{2}\vI )]\), and both are \(< \infty \).
□
Integration by Parts in De Bruijn Identity
Finally, we fill in the integration-by-parts argument alluded to in the proofs of
Theorems B.2 and B.3. The argument is conceptually pretty simple but requires
some technical estimates to show that the boundary term in the integration-by-parts
vanishes.
Lemma B.2.Let \(\vx \) be any random variable such that Assumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). For \(t \geq 0\), let \(p_{t}\) be the density of \(\vx _{t}\). Then for
a constant \(c \in \R \) it holds
where \(\odif {\sigma (\vxi )}\) denotes an integral over the “surface measure”,
i.e., the inherited measure on \(\partial \cK \), namely the boundary of \(\cK \), and accordingly \(\vxi \) takes
values on this surface and \(\vn (\vxi )\) is the unit normal vector to \(\cK \) at the surface point \(\vxi \). Now,
taking \(\plainphi (\vxi ) \doteq p_{t}(\vxi )\) and \(\psi (\vxi ) \doteq c + \log p_{t}(\vxi )\), over a ball \(B_{r}(\vzero )\) of radius \(r > 0\) centered at \(\vzero \) (so that \(\partial B_{r}(\vzero )\) is the sphere of radius \(r\)
centered at \(\vzero \) and \(\vn (\vxi ) = \vxi /\norm {\vxi }_{2} = \vxi /r\)):
where the first equality follows by dominated convergence on the integrand. It
remains to compute the last limit. For this, we take asymptotic expansions of each
term. The main device is as follows: for \(\vxi \in \partial B_{r}(\vzero )\), we have \(\norm {\vxi }_{2} = r\), so
For \(r > R > 0\) (as is suitable, because we are going to take the limit \(r \to \infty \) while \(R\) is fixed), both
sides are negative. This makes sense: most of the probability mass is contained within
the ball of radius \(R\) and thus the score points inwards, having a negative inner product
with the outward-pointing vector \(\vxi \). Thus using the appropriate bounds for \(p_{t}(\vxi )\),
So one can see that, letting the
surface area of \(\partial B_{r}(\vzero )\) be \(\omega _{D} r^{D - 1}\) where \(\omega _{D}\) is a function of \(D\), it holds
Local Invertibility of the Denoiser \(\bar {\vx }\)
Here we provide some results used in the proof of Theorem B.3 which are
appropriate generalizations of corresponding results in [Gri11].
Lemma B.3 (Generalization of [Gri11], Lemma A.1).Let \(\vx \) be any random
variable such that Assumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic
process (B.2.1). Let \(s, t \in [0, T]\) be such that \(0 \leq s < t \leq T\), and let \(\bar {\vx }(\vxi ) \doteq \Ex [\vx _{s} \mid \vx _{t} = \vxi ]\). The Jacobian \(\bar {\vx }^{\prime }(\vxi )\) is symmetric positive
definite.
is symmetric positive semidefinite. Indeed it is obviously symmetric (by Clairaut’s
theorem). To show its positive semidefiniteness, we plug in the expectation
representation of \(p_{t}\) given by (B.2.3) (and \(\nabla p_{t}\), \(\Delta p_{t}\) by Proposition B.1) to obtain (where \(\vx \) is as
defined and \(\vy \) is i.i.d. as \(\vx \)),
Since \(\vx \) and \(\vy \) are i.i.d., the whole integral (i.e., the original quadratic form) is \(0\) if
and only if \(s = 0\) and \(\vx \) has support entirely contained in an affine subspace which is
orthogonal to \(\vv \). But this is ruled out by assumption (i.e., that \(\vx \) has a density on \(\R ^{D}\)), so
the Jacobian \(\bar {\vx }^{\prime }(\vxi )\) is symmetric positive definite. □
Lemma B.4 (Generalization of [Gri11] Corollary A.2, Part 1).Let \(f \colon \R ^{D} \to \R ^{D}\) be any
differentiable function whose Jacobian \(f^{\prime }(\vx )\) is symmetric positive definite. Then \(f\) is
injective, and hence invertible as a function \(\R ^{D} \to \Range (f)\) where \(\Range (f)\) is the range of \(f\).
Proof.Suppose that \(f\) were not injective, i.e., there exists \(\vx , \vx ^{\prime }\) such that \(f(\vx ) = f(\vx ^{\prime })\) while \(\vx \neq \vx ^{\prime }\). Define \(\vv \doteq (\vx ^{\prime } - \vx )/\norm {\vx ^{\prime } - \vx }_{2}\).
Define the function \(g \colon \R \to \R \) as \(g(t) \doteq \vv ^{\top }f(\vx + t\vv )\). Then \(g(0) = \vv ^{\top }f(\vx ) = \vv ^{\top }f(\vx ^{\prime }) = g(\norm {\vx ^{\prime } - \vx }_{2})\). Since \(f\) is differentiable, \(g\) is differentiable, so the
derivative \(g^{\prime }\) must vanish for some \(t^{\ast } \in (0, \norm {\vx ^{\prime } - \vx }_{2})\) by the mean value theorem. However,
since the
Jacobian is positive definite. Thus we arrive at a contradiction, as claimed.
□
Combining the above two results, we obtain the following crucial result.
Corollary B.1 (Generalization of [Gri11] Corollary A.2, Part 2).Let \(\vx \) be any
random variable such that Assumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the
stochastic process (B.2.1). Let \(s, t \in [0, T]\) be such that \(0 \leq s < t \leq T\), and let \(\bar {\vx }(\vxi ) \doteq \Ex [\vx _{s} \mid \vx _{t} = \vxi ]\). Then \(\bar {\vx }\) is injective, and
therefore invertible onto its range.
Proof.The only thing left to show is that \(\bar {\vx }\) is differentiable, but this is immediate
from Tweedie’s formula (Theorem 3.2) which shows that \(\bar {\vx }\) is differentiable if and
only if \(\nabla \log p_{t}\) is differentiable, and this is provided by Equation (B.2.3). □
Controlling the Laplacian \(\Delta \log p_{t}\)
Finally, we develop a technical estimate which is required for the proof of
Theorem B.3 and actually motivates the assumption for the viable \(t\).
Lemma B.5.Let \(\vx \) be any random variable such that Assumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). Let \(p_{t}\) be the density of \(\vx _{t}\). Then, for \(t > 0\) it
holds
A trivial lower bound on this trace is \(0\), since covariance matrices are positive
semidefinite. To find an upper bound, note that \(\vy _{\vxi }\) takes values only in the support of \(\vx \)
(since \(p\) is a factor of the density \(q_{\vxi }\) of \(\vy _{\vxi }\)), which by Assumption B.1 is a compact set \(\cS \)
with radius \(R \doteq \sup _{\vxi \in \cS }\norm {\vxi }_{2}\). So
Here we calculate some useful derivatives which will be reused throughout the
appendix.
Proposition B.1.Let \(\vx \) be any random variable such that Assumptions B.1and B.2hold, and let \((\vx _{t})_{t \in [0, T]}\) be the stochastic process (B.2.1). For \(t \geq 0\), let \(p_{t}\) be the density of \(\vx _{t}\). Then
Proof.We use the convolution representation of \(p_{t}\), namely (B.2.3). First
taking the time derivative, a computation yields that Proposition B.3
applies,5
so we can bring the derivative inside the integral/expectation as:
In this appendix, we differentiate under the integral sign many times, and it is
important to know when we can do this. There are two kinds of differentiating under
the integral sign:
1.
Differentiating an integral \(\int f_{t}(\vxi )\odif {\vxi }\) with respect to the auxiliary parameter \(t\).
2.
Differentiating a convolution \((f * g)(\vxi ) = \int f(\vxi )g(\vxi - \vu )\odif {\vu }\) with respect to the variable \(\vxi \).
For the first category, we give a concrete result, stated without proof but
attributable to the linked source, which derives the following result as a special case
of a more general theorem about the interaction of differential operators and
tempered distributions, much beyond the scope of the book. A full formal reference
can be found in [Jon82].
Proposition B.3 ([Jon82], Section 11.12).Let \(f \colon (0, T) \times \R ^{D} \to \R \) be such that:
\(f\) is a jointly measurable function of \((t, \vxi )\);
For Lebesgue-almost every \(\vxi \in \R ^{D}\), the function \(t \mapsto f_{t}(\vxi )\) is absolutely continuous;
\(\pdv {f_{t}}{t}\) is locally integrable, i.e., for every \([t_{\min }, t_{\max }] \subseteq (0, T)\) it holds
For the second category, we give another concrete result, stated without proof but
fully formalized in [BB11].
Proposition B.4 ([BB11], Proposition 4.20).Let \(f\) be \(k\)-times continuously
differentiable with compact support, and let \(g\) be locally integrable. Then the
convolution \(f * g\) defined by
Although not in the book, a simple integration by parts argument shows
that if \(g\) is also \(k\)-times differentiable, then we can “trade off” the regularity:
In this section, we prove Theorem 4.1. Following our conventions throughout
this appendix, we write \(\cS = \Supp (\vx )\) for the compact support of the random variable
\(\vx \).
As foreshadowed, we will make a regularity assumption on the support set \(\cS \) to
prove Theorem 4.1. One possibility for proceeding under minimal assumptions would
be to instantiate the results of [RBK18; RKB23] in our setting, since these results
apply to sets \(\cS \) with very low regularity (e.g., Cantor-like sets with fractal
structure). However, we have found precisely computing the constants in these
results, a necessary endeavor to assert a conclusion like Theorem 4.1, to
be somewhat onerous in our setting. Our approach is therefore to add a
geometric regularity assumption on the set \(\cS \) that sacrifices some generality,
but allows us to develop a more transparent argument. To avoid sacrificing
too much generality, we must ensure that low-dimensionality in the set \(\cS \) is
not prohibited. We therefore consider the running example we have used
throughout the book, the mixture of low-rank Gaussian distributions. In this
geometric setting, we model \(\cS \) as a union of hyperspheres, each of dimension \(d_k\)
(possibly much smaller than \(D\)), living in mutually orthogonal subspaces of
\(\R ^D\). This is a geometric simplification of the Gaussian mixture model: each
component’s support is approximated by a sphere in the subspace spanned by
its principal directions. When the component dimensions \(d_k\) are large, this
assumption is equivalent to a mixture of low-rank Gaussians assumption, by
high-dimensional measure concentration. The CRATE models we develop and train
in Chapters 5 and 8 have their subspace dimensions set compatibly with this
assumption.
Assumption B.3. The support \(\cS \subset \R ^D\) of the random variable \(\vx \) is a finite union of
\(K\) spheres, each with dimension \(d_k\), \(k \in [K]\). The probability that \(\vx \) is drawn from the \(k\)-th
sphere is given by \(\pi _k \in [0, 1]\), and conditional on being drawn from the \(k\)-th sphere, \(\vx \) is
uniformly distributed on that sphere. The supports satisfy that each sphere is
mutually orthogonal with all others.
We proceed under the simplifying Assumption B.3 in order to simplify excessive
technicality, and to connect to an important running example used throughout the
monograph. We believe our results can be generalized to support \(\cS \) from the class of
sets with positive reach with additional technical effort, but leave this for the
future.
B.3.1 Proof of Relationship Between Rate Distortion and Covering
We briefly sketch the proof, then proceed to establishing three fundamental lemmas,
then give the proof. The proof will depend on notions introduced in the sketch
below.
Obtaining an upper bound on the rate distortion function (4.1.9) is straightforward:
by the rate characterization (i.e., the rate distortion function is the minimum rate
of a code for \(\vx \) with expected squared \(\ell ^2\) distortion \(\epsilon \)), upper bounding \(\cR _{\epsilon }(\x )\) only
requires demonstrating one code for \(\x \) that achieves this target distortion, and
any \(\epsilon \)-covering of \(\Supp (\x )\) achieves this, with rate equal to the base-\(2\) logarithm of
the cardinality of the covering. The lower bound is more subtle. We make
use of the Shannon lower bound, discussed in Remark 4.3: working out
the constants in [LZ94, §III, (22)] gives a more precise version of the result
quoted in Equation (4.1.14) (in bits, of course): for any random variable \(\vx \) with
compact support and a density, it holds
with entropy (etc.) in nats in this
expression. The constant can be easily estimated using Stirling’s approximation. A
quantitative form of Stirling’s approximation which is often useful gives for
any \(x > 0\) [Jam15]
Now, the important constraint for our current purposes is that the Shannon
lower bound requires the random variable \(\vx \) to have a density, which rules
out many low-dimensional distributions of interest. But let us momentarily
consider the situation when \(\vx \) does admit a density. The assumption that \(\vx \)
is uniformly distributed on its support is easily formalized in this setting:
for any Borel set \(A \subset \cS \), we have
The proof then
concludes with a lemma that relates the ratio \(\volume (\cS ) / \volume (B_{\epsilon })\) to the \(\epsilon \)-covering number of \(\cS \) by \(\epsilon \)
balls.
To extend the program above to degenerate distributions satisfying Assumption B.3,
our proof of the lower bound in Theorem 4.1 will leverage an approximation
argument of the actual low-dimensional distribution \(\vx \) by “nearby” distributions which
have densities, similarly but not exactly the same as the proof sketch preceding
Theorem B.1. We will then link the parameter introduced in the approximating
sequence to the distortion parameter \(\epsilon \) in order to obtain the desired conclusion in
Theorem 4.1.
Definition B.1. Let \(\cS \) be a compact set. For any \(\delta > 0\), define the \(\delta \)-thickening of \(\cS \), denoted
\(\cS _{\delta }\), by
For a
compact set \(\cS \), Weierstrass’s theorem implies that for any \(\vxi \in \R ^D\), there is always some \(\vxi ' \in \cS \)
attaining the infimum in the distance function. Compactness of \(\cS _{\delta }\) follows
readily from compactness of \(\cS \), so \(\volume (\cS _{\delta })\) is finite for any \(\delta > 0\). It is then possible to
make the following definition of a thickened random variable, specialized to
Assumption B.3.
Definition B.2. Let \(\vx \) be a random variable such that \(\Supp (\vx ) = \cS \) is a union of \(K\)
hyperspheres, distributed as in Assumption B.3. Denote the support of each
component of the mixture by \(\cS _k\). Define the thickened random variable \(\vx _{\delta }\) as the
mixture of measures where each component measure is uniform on the thickened
set \(\cS _{k, \delta }\) (Definition B.1), for \(k \in [K]\), with mixing weights \(\pi _k\).
Lemma B.6.Suppose the random variable \(\vx \) satisfies Assumption B.3. Then if \(0 < \delta < \tfrac {1}{2}\), the
thickened random variable \(\vx _{\delta }\) (Definition B.2) satisfies for any \(\epsilon > 0\)
The proof of Lemma B.6 is deferred to Section B.3.2. Using Lemma B.6, the
above program can be realized, because the random variable \(\vx _{\delta }\) has a density that is
uniform with respect to the Lebesgue measure.
(Proof of Theorem 4.1). The upper bound is readily shown. If \(S\) is any \(\epsilon \)-cover
of the support of \(\vx \) with cardinality \(\cN _{\epsilon }(\Supp (\vx ))\), then consider the coding scheme assigning
to each \(\vxi \in \Supp (\vx )\) the reconstruction \(\hat {\vxi } = \argmin _{\vxi ' \in S}\, \norm {\vxi - \vxi '}_2\), with ties broken arbitrarily. Then ties occur with
probability zero, and the fact that \(S\) covers \(\Supp (\vx )\) at scale \(\epsilon \) guarantees distortion no
larger than \(\epsilon \); the rate of this scheme is \(\log _2 \cN _{\epsilon }(\Supp (\vx ))\).
For the lower bound, let \(0 < \delta < \tfrac {1}{2}\), and consider the thickened random variable \(\vx _{\delta }\).
By Lemma B.6, we have
in terms of the covering number.
Since \(\Supp (\vx _{\delta }) = \Supp (\vx ) + B_{\delta }\), where \(+\) here denotes the Minkowski sum, a standard application of
volume bound arguments (see e.g. [Ver18, Proposition 4.2.12]) gives
(Proof of Lemma B.6). It suffices to show that any code for \(\vx \) with expected
squared distortion \(\epsilon ^2\) produces a code for \(\vx _{\delta }\) with the same rate and distortion not
much larger, for a suitable choice of \(\delta \). So fix such a code for \(\vx \), achieving rate \(R\) and
expected squared distortion \(\epsilon ^2\). We write \(\hat {\vx }\) for the reconstructed random variable
using this code, and \(\mathrm {q} : \Supp (\vx ) \to \Supp (\vx )\) for the associated encoding-decoding mapping (i.e., \(\hat {\vx } = \mathrm {q}(\vx )\)).
Now let \(\cS _k\) denote the \(k\)-th hypersphere in the support of \(\vx \). There is an orthonormal
basis \(\vU _{k} \in \R ^{D \times d_k}\) such that \(\Span (\cS _k) = \Span (\vU _k)\). The following orthogonal decomposition of the support set \(\cS \) will be
used repeatedly throughout the proof. We have
By orthogonal projection, for any \(k \in [K]\) any \(\vxi \in \R ^D\) can be written as \(\vxi = \vxi ^{\|} + \vxi ^{\perp }\), with \(\vxi ^{\|} \in \Span (\cS _k)\) and \(\ip {\vxi ^{\|}}{\vxi ^\perp } = 0\). Then for
any \(\vxi ' \in \cS _k\), we have
If \(\vxi ^{\|}\) is zero, it is clear that the above
distance is equal to \(1\) for every \(\vxi ' \in \cS _k\). Hence, if we define a projection mapping \(\pi _{\cS _k}(\vxi )\) by
for any
\(\vxi \in \R ^D\) with \(\vU _k^\top \vxi \neq \mathbf {0}\), then \(\pi _{\cS _k}(\vxi ) = \argmin _{\vxi ' \in \cS _k}\left\Vert \vxi ' - \vxi \right\Vert_2\). We choose \(0 < \delta < 1\), so that the thickened set \(\cS _{\delta }\) contains no points \(\vxi \in \R ^D\) at which any
of the projection maps \(\pi _{\cS _k}\) is not well-defined. So the thickened set \(\cS _{\delta }\) satisfies
We are going to show next that every such \(\vxi \in \cS _{\delta }\) can be uniquely associated to a
projection onto a single subspace in the mixture, which will allow us to define a
corresponding projection onto \(\cS \). Given a \(\vxi \in \cS _{\delta }\), by the above, we can find a subspace \(\vU _k\) such
that the orthogonal decomposition \(\vxi = \vxi ^{\|}_k + \vxi ^{\perp }_k\) satisfies
where the second line uses the orthogonality assumption on the subspaces \(\vU _k\), and
the third uses the fact that orthogonal projections are nonexpansive. Hence, the \(j\)-th
distance satisfies
This implies that if \(0 < \delta < 1/2\), every \(\vxi \in \cS _{\delta }\) has a unique closest subspace in the
mixture. Hence, under this condition, the following mapping \(\pi _{\cS } : \cS _{\delta } \to \cS \) is well-defined:
Clearly this is associated to a rate-\(R\) code for \(\vx _{\delta }\),
because it uses the encoding-decoding mappings from the rate-\(R\) code for \(\vx \). We have to
show that it achieves small distortion. We calculate
so the expectation also satisfies this estimate.
For the second term, it will suffice to characterize the density of the random variable \(\pi _{\cS }(\vx _{\delta })\)
as being sufficiently close to the density of \(\vx \)—which, as Assumption B.3 implies, is a
mixture of uniform distributions on each sub-sphere \(\cS _k\). By the argument above, every
point \(\vxi \in \cS _{\delta }\) can be associated to one and only one subspace \(\vU _k\), which means that
the mixture components in the definition of \(\cS _{\delta }\) (recall Definition B.2) do not
overlap. Hence, the density \(\pi _{\cS }(\vx _{\delta })\) can be characterized by studying the effect of \(\pi _{\cS _k}\)
on the conditional random variable \(\vx _{\delta }\), conditioned on being drawn from \(\cS _{k, \delta }\).
Denote this measure by \(\mu _{k, \delta }\). We claim that the pushforward of this measure
under \(\pi _{\cS _k}\) is uniform on \(\cS _k\). To see that this holds, we recall Equation (B.3.21),
which gives the characterization
The conditional distribution in question is
uniform on this set; we need to show that the projection \(\pi _{\cS _k}\) applied to this
conditional random variable yields a random variable that is uniform on
\(\cS _k\). With respect to these coordinates, we have seen that \(\pi _{\cS _k}(\vxi ^\| + \vxi ^\perp ) = \vxi ^\| / \norm {\vxi ^{\|}}_2\). Hence, for any \(\vxi \in \cS _{k}\),
we have that the preimage of \(\vxi \) in \(\cS _{k, \delta }\) under \(\pi _{\cS _k}\) is
To show that \((\pi _{\cS _k})_{\sharp } \mu _{k, \delta }\) is uniform, we
need to decompose the integral of the uniform density on \(\cS _{k, \delta }\) in a way that
makes it clear that each of the fibers \(\pi _{\cS _k}^{-1}(\vxi )\) (for each \(\vxi \in \cS _k\)) “contributes” equally to the
integral.6
We have by Definition B.2
In particular, the integration over the orthogonal
coordinates factors. Let \(\odif \vtheta ^{d}\) denote the uniform (Haar) measure on the sphere of radius \(1\)
in \(\R ^d\). Converting the \(\vxi ^{\|}\) integral to polar coordinates, we have
Comparing to the fiber
representation (B.3.40), we see that we need to “integrate out” over the \(r\) and \(\vxi ^\perp \)
components of the preceding integral in order to verify that the pushforward is
uniform. But this is evident, as the previous expression shows that the value of this
integral is independent of \(\vxi ^\|\)—or, equivalently in context, the value of the spherical
component \(\vtheta ^{d_k}\).
Thus it follows from the above argument that \(\pi _{\cS }(\vx _{\delta })\) is uniform. Because the
assumption on \(\delta \) implies that the mixture components in the distribution of \(\vx _{\delta }\) do
not overlap, the mixing weights \(\pi _k\) are also preserved in the image \(\pi _{\cS }(\vx _{\delta })\), and in
particular, the distribution of \(\pi _{\cS }(\vx _{\delta })\) is equal to the distribution of \(\vx \). Hence the
second term in Equation (B.3.27) satisfies
because \(\mathrm {q}\) is a distortion-\(\epsilon \) code for
\(\vx \).
We have thus shown that the hypothesized rate-\(R\), (expected squared) distortion-\(\epsilon ^2\)
code for \(\vx \) produces a rate-\(R\), (expected squared) distortion \(\delta + \epsilon \) code for \(\vx _{\delta }\). This establishes
that