In this chapter

Chapter 4
Representation Learning via Lossy Compression

We compress to learn, and we learn to compress.”

\(~\)High-Dimensional Data Analysis, Wright and Ma, 2022

Let us recap what we have covered so far. We have shown that it is possible to identify and access a distribution with arbitrary low-dimensional support, such as the one illustrated in Figure 4.1 (left), by gradually transforming the distribution of pure noise (high entropy) into the distribution to be learned via iterative denoising. We have shown that the denoiser at each iteration is directly connected to the gradient of the distribution’s log-density via Tweedie’s formula (3.2.23). We showed that in practice such a denoiser \(\bar {\vx }_{\theta }\) can be estimated using finite samples. Thus we developed a general way of learning or pursuing a data distribution.

Nevertheless, in this methodology the encoding of the distribution is implicit in the denoiser’s functional form and parameters, if any. In fact, acute readers might have noticed that for a general distribution we have never explicitly specified what the functional form for the denoiser should be. In practice, people typically model it with some deep neural network that has an empirically designed architecture. In addition, although we know the above denoising process reduces the entropy, we do not know by how much, nor do we know the entropy of the intermediate and resulting distributions. Hence it would be very useful to know the (optimal) form of the denoising operation and how to measure the change in entropy. Finally, we showed at the end of the last Chapter that if the denoising process reduces the entropy too much it leads to learning only the training data, with no possible interpolation or extrapolation. The methodological innovation which resolves all of these issues will be developed in the Chapter to come.

Measure of information. Recall that our general goal is to model data from a continuous distribution with low-dimensional support. If we want to identify the simplest model that generates the data, we might consider three typical measures of parsimony: dimension, volume, or differential entropy. If we use dimension, then obviously the best model for a given dataset is the empirical distribution itself, which is zero-dimensional.1 If we use the entropy of the data instead, then for all distributions with low-dimensional support, the differential entropy is always negative infinity. Or if we use the volume of the space spanned by the data, the volume of the support of any low-dimensional model is always zero. So, among all distributions with low-dimensional support that could have generated the same data samples, how can we decide which one is better based on these measures of parsimony that cannot distinguish among themselves at all?

In this Chapter, we discuss a framework that alleviates the above technical difficulty by associating the learned distribution with an explicit computable encoding and decoding scheme, following and making more explicit the second approach suggested at the end of Section 3.1.3. As we will see, such an approach essentially allows us to accurately approximate the entropy of the learned distributions in terms of a lossy coding rate associated with the coding scheme, as described and motivated (but not formalized) at the end of Section 3.3. Roughly speaking, such a measure can be viewed as a generalized notion of volume of the empirical data distribution. With such a measure, we can accurately quantify how much the entropy is reduced through a transformation of the distribution, say by a denoiser. In addition, with such a measure, we can derive, at least in principle, the optimal denoising operator that reduces the entropy the fastest.

Representation learning. Even though the diffusion and denoising process introduced in the previous chapter allows us, in principle, to learn an arbitrary low-dimensional distribution, it does not give an explicit representation of the distribution. As we will see, the learned distribution is to be repeatedly accessed and used for subsequent tasks (such as used as the prior for Bayesian inference in Chapter 7). Hence it is desirable that we can transform the data distribution, say by a mapping \(f(\vx ,\theta )\), to some standard structured form that facilitates efficient access, say a mixture of (incoherent) linear subspaces or low-rank Gaussians, as illustrated in Figure 4.1 right. We will formalize this process as “representation learning” in this Chapter.

PIC

Figure 4.1: Learn a structured representation (right) for the data distribution (left). Notice that in addition to learning (and sampling) the distribution of the original data as we did in the previous Chapter 3, we here want to learn a transform \(f(\vx , \theta )\) for the data distribution so as to result in a representation that has desired better geometric or statistical properties, say linear and incoherent.

As we will also see, for an arbitrary distribution, the above measure of information (coding rate) is not (easily) computable. But for a mixture of subspaces or low-rank Gaussians, we can derive an accurate closed form for the measure. By optimizing this measure, it will allow us to find the optimal representation. As we will see in the next Chapter 5, optimizing the measure also allows us to derive explicit forms of the operator that performs the transformation \(f(\vx ,\theta )\) in the most efficient way. This will lead to a principled explanation for the architecture of deep networks, as well as to more efficient deep-architecture designs.

4.1 Compression via Lossy Coding

In this section, we develop the lossy coding rate as a computable measure of complexity for distributions with low-dimensional support—a setting where classical measures such as entropy, dimension, and volume all fail to distinguish among candidate models. We first motivate the necessity of lossy coding with respect to these pathologies (Section 4.1.1), then show that the rate distortion function connects this coding-theoretic measure to the geometry of the data support via sphere covering (Theorem 4.1). This connection is the bridge between information-theoretic and geometric approaches to learning low-dimensional distributions. Finally, we develop a tight closed-form approximation to the lossy coding rate for Gaussians (Section 4.1.3) and an effective algorithm for clustering mixtures of low-dimensional Gaussians (Section 4.1.4), which will be generalized to nonlinear distributions in the following section.

4.1.1 Necessity of Lossy Coding

We have previously, multiple times, discussed a difficulty: if we learn the distribution from finite samples in the end, and our function class of denoisers contains enough functions, how do we ensure that we sample from the true distribution (with low-dimensional supports), instead of any other distribution that may produce those finite samples with high probability? Let us reveal some of the conceptual and technical difficulties with some concrete examples.

Example 4.1 (Volume, Dimension, and Entropy). For the example shown on the top of Figure 4.2, suppose we have taken some samples from a uniform distribution on a line (say in a 2D plane). The volume of the line or the sample sets is zero. Geometrically, the empirical distribution on the produced finite sample set is the minimum-dimension one which can produce the finite sample set.2 But this is in seemingly contrast with yet another measure of complexity: entropy. The (differential) entropy of the line is negative infinity but the (discrete) entropy of this sample set is finite and positive. So we seem to have a paradoxical situation according to these common measures of parsimony or complexity: they cannot properly differentiate among (models for) distributions of low-dimensional supports at all, and some seem to differentiate them even in exactly opposite manners.3 \(\blacksquare \)

Example 4.2 (Density). Consider the two sets of sampled data points shown in Figure 4.2. Geometrically, they are essentially the same: each set consists of eight points and each point has occurred with equal frequency \(1/8\)th. The only difference is that for the second data set, some points are “close” enough to be viewed as having a higher density around their respective “cluster.” Which one is more relevant to the true distribution that may have generated the samples? How can we reconcile such ambiguity in interpreting this kind of (empirical) distributions?

PIC
Figure 4.2: Eight points observed on a line with different geometric configurations; evenly distributed (top) or closely clustered (bottom).
\(\blacksquare \)

There is yet another technical difficulty associated with constructing an explicit encoding and decoding scheme for a data set. Given a sampled data set in \(\vX = [\vx _1, \ldots , \vx _N]\), how to design a coding scheme that is implementable on machines with finite memory and computing resources? Note that even representing a general real number requires an infinite number of digits or bits. Therefore, one may wonder whether the entropy of a distribution is a direct measure for the complexity of its (optimal) coding scheme. We examine this matter with another simple example.

Example 4.3 (Precision). Consider a discrete distribution \(\vX = [e, \pi ]\) with equal probability \(1/2\) taking the values of the Euler number \(e \approx 2.71828\) or the number \(\pi \approx 3.14159\). The entropy of this distribution is \(H =1\), which suggests that one may encode the two numbers by a one-bit digit \(0\) or \(1\), respectively. But can you realize a decoding scheme for this code on a finite-state machine? The answer is actually no, as it takes infinitely many bits to describe either number precisely. \(\blacksquare \)

Hence, it is generally impossible to have an encoding and decoding scheme that can precisely reproduce samples from an arbitrary real-valued distribution.4 But there would be little practical value to encode a distribution without being able to decode for samples drawn from the same distribution.

So to ensure that any encoding/decoding scheme is computable and implementable with finite memory and computational resources, we need to quantify the sample \(\vx \) and encode it only up to a certain precision, say \(\epsilon > 0\). By doing so, in essence, we treat any two data points as equivalent if their distance is less than \(\epsilon \). More precisely, we would like to consider coding schemes

\begin{equation} \vx \mapsto \hat \vx \tag{4.1.1} \end{equation}

such that the expected error caused by the quantization is bounded by \(\epsilon \). It is mathematically more convenient, and conceptually almost identical, to bound the expected squared error by \(\epsilon ^{2}\), i.e.,

\begin{equation} \Ex [d(\vx , \hat \vx )^{2}] \le \epsilon ^{2}. \tag{4.1.2} \end{equation}

Typically, the distance \(d\) is chosen to be the Euclidean distance, or the 2-norm.5 We will adopt this choice in the sequel.

4.1.2 Rate Distortion and Data Geometry

Of course, among all encoding schemes that satisfy the above constraint, we would like to choose the one that minimizes the resulting coding rate. For a given random variable \(\vx \) and a precision \(\epsilon \), this rate is known as the rate distortion, denoted as \(\cR _{\epsilon }(\vx )\).

To understand this rate distortion better, we turn to a deep theorem of information theory, originally proved in Shannon [Sha59], which relates this rate distortion to other information-theoretic quantities. To understand this theorem, we first need to define a measure of difference between two distributions known as the Kullback-Leibler (KL) divergence, or relative entropy, which is denoted by \(\KL (\cdot \mmid \cdot )\) and defined in relation to the entropy and cross-entropy (see Section 3.1.3) as

\begin{align} \KL (p \mmid q) &= \CE (p \mmid q) - h(p) = \Ex _{\vx \sim p}[-\log q(\vx )] - \Ex _{\vx \sim p}[-\log p(\vx )] \tag{4.1.3} \\ &= \Ex _{\vx \sim p}[\log (p(\vx )/q(\vx ))]. \tag{4.1.4} \end{align}

The KL divergence is always non-negative, and equals zero if and only if \(p = q\);6 in the case where \(p\) and \(q\) have the same support and both are probability densities, this can be proved by the following calculation based on Jensen’s inequality and the fact that the function \(\log (\cdot )\) is strictly concave:

\begin{align} - \KL (p\mmid q) &= -\Ex _{\vx \sim p}[\log (p(\vx )/q(\vx ))] = \Ex _{\vx \sim p}[\log (q(\vx )/p(\vx ))] \tag{4.1.5} \\ &\leq \log \Ex _{\vx \sim p}[q(\vx )/p(\vx )] = \log \int _{\R ^{D}} \frac {q(\vxi )}{p(\vxi )}p(\vxi )\odif {\vxi } \tag{4.1.6} \\ &= \log \int _{\R ^{D}}q(\vxi )\odif {\vxi } = \log \Ex _{\vx \sim q}[1] = \log 1 = 0. \tag{4.1.7} \end{align}

Then, we define the mutual information \(I(\cdot ; \cdot )\), which is a measure of dependence between two random variables, as

\begin{equation}\tag{4.1.8}\label {eq:mutual-information} I(\vx ; \hat {\vx }) = \KL (p(\vx , \hat {\vx }) \mmid p(\vx ) p(\hat {\vx })), \end{equation}

With the previous definitions in place, we may write the rate distortion \(\cR _{\epsilon }\) in purely probabilistic terms as

\begin{equation} \cR _{\epsilon }(\vx ) = \min _{p(\hat \vx \mid \vx ): \Ex [\norm {\vx -\hat \vx }_2^{2}] \le \epsilon ^{2}} I(\vx ; \hat {\vx }). \tag{4.1.9}\label {eqn:rate-distortion-general} \end{equation}

Note that the minimization in (4.1.9) is over all conditional distributions \(p(\hat \vx \mid \vx )\) that satisfy the distortion constraint \(\Ex [\norm {\vx - \hat \vx }_2^{2}] \le \epsilon ^{2}\). Each such conditional distribution induces a joint distribution \(p(\vx , \hat {\vx }) = p(\hat \vx \mid \vx ) p(\vx )\), which determines the mutual information (4.1.8). Many convenient properties of the mutual information (and hence the rate distortion) are implied by corresponding properties of the KL divergence. From the definition, we know that \(\cR _{\epsilon }(\vx )\) is a non-increasing function in \(\epsilon \).

Example 4.4 (Rate distortion of Gaussian source (Theorem 13.3.3, [CT91])). Consider a Gaussian random variable \(\vx \sim \dNorm (0, \sigma ^{2})\). For this case, it is well-known that the rate distortion function can be computed in closed form as:

\begin{equation} \cR _{\epsilon }(x) = \casework {\frac {1}{2}\log \left ( {\sigma ^{2}}/{\epsilon ^{2}}\right ), & 0 \leq \epsilon \leq \sigma , \\ 0, & \epsilon > \sigma }. \tag{4.1.10} \end{equation}
This extends in a particularly interesting way to the case of \(D\) dimensions. Suppose that \(\vx \sim \dNorm (\vzero , \vSigma )\), where \(\vSigma \in \PSD (D)\) has eigenvalues \(\lambda _{1} \geq \cdots \geq \lambda _{D} \geq 0\). Then the rate distortion is obtainable via “water filling” as
\begin{equation} \cR _{\epsilon }(\vx ) = \frac {1}{2} \sum _{i = 1}^{D}\log \rp {\frac {\lambda _{i}}{\min \{\kappa , \lambda _{i}\}}}, \tag{4.1.11} \end{equation}
where the “water level” \(\kappa \) satisfies \(\sum _{i = 1}^{D}\min \{\kappa , \lambda _{i}\} = \epsilon ^{2}\). Intuitively, directions with variance \(\lambda _{i}\) larger than \(\kappa \) are effectively encoded, while directions with smaller variance are discarded, ensuring that the distortion constraint is satisfied with minimal mutual information. Please refer to Exercise 4.4 for a detailed derivation of this closed form. \(\blacksquare \)

Remark 4.1. As it turns out, the rate distortion is an implementable approximation to the entropy of \(\vx \) in the following sense. Assume that \(\vx \) and \(\hat {\vx }\) are continuous random vectors. Then the mutual information can be written as

\begin{equation}\tag{4.1.12}\label {eq:information-entropy} I(\vx ; \hat \vx ) = h(\vx ) - h(\vx \mid \hat \vx ), \end{equation}
where \(h(\vx \mid \hat \vx ) = -\bE [\log _2 p(\vx \mid \hat \vx )]\) is the conditional entropy of \(\vx \) given \(\hat \vx \). Hence, the minimal coding rate is achieved when the difference between the entropy of \(\vx \) and the conditional entropy of \(\vx \) given \(\hat \vx \) is minimized among all distributions that satisfy the constraint: \(\mathbb {E} [\norm {\vx - \hat \vx }_2^2] \le \epsilon ^2\).

In fact, it is not necessary to assume that \(\vx \) and \(\hat \vx \) are continuous to obtain the above type of conclusion. For example, if both random vectors are instead discrete, we have after a suitable interpretation of the KL divergence for discrete-valued random vectors that

\begin{equation}\tag{4.1.13}\label {eq:mutual} I(\vx ; \hat \vx ) = H(\vx ) - H(\vx \mid \hat \vx ), \end{equation}
where \(H(\vx )\) is defined in (3.1.1) of Section 3.1.3. More generally, advanced mathematical notions from measure theory can be used to define the mutual information (and hence the rate distortion) for arbitrary random variables \(\vx \) and \(\hat \vx \), including those with rather exotic low-dimensional distributions; see Cover and Thomas [CT91, §8.5].

Remark 4.2. Given a set of data points in \(\vX = [\vx _1,\ldots , \vx _N]\), one can always interpret them as samples from a uniform discrete distribution with equal probability \(1/N\) on these \(N\) vectors. The entropy for such a distribution is \(H(\vX ) = \frac {1}{N}\log _2 N\).7 However, even if \(\vX \) is a uniform distribution on its samples, the coding rate \(\cR _{\epsilon }(\vX )\) achievable with a lossy coding scheme could be significantly lower than \(H(\vX )\) if these samples are not so evenly distributed and many are clustered closely to each other. Therefore, for the second distribution shown in Figure 4.2, for a properly chosen quantization error \(\epsilon \), the achievable lossy coding rate can be significantly lower than coding it as a uniform distribution.8 Also notice that, with the notion of rate distortion, the difficulty discussed in Example 4.3 also disappears: We can choose two rational numbers that are close enough to each of the two irrational numbers. The resulting coding scheme will have a finite complexity.

Example 4.5. Sometimes, one may face an opposite situation when we want to fix the coding rate first and try to find a coding scheme that minimizes the distortion. For example, suppose that we only want to use a fixed number of codes for points sampled from a distribution, and we want to know how to design the codes such that the average or maximum distortion is minimized during the encoding/decoding scheme. For example, given a uniform distribution on a unit square, we wonder how precisely we can encode points drawn from this distribution, with say \(n\) bits. This problem is equivalent to asking what is the minimum radius (i.e., distortion) such that we can cover the unit square with \(2^n\) discs of this radius. Figure 4.3 shows approximately optimal coverings of a square with \(n = 4, 6, 8\), so that \(2^{n} = 16, 64, 256\) discs, respectively. Notice that the optimal radii of the discs decreases as the number of discs \(2^{n}\) increases.

PIC PIC PIC
Figure 4.3: Approximations to the optimal solutions for \(2^{4}\), \(2^{6}\), and \(2^{8}\) discs covering a square, along with the corresponding radii, calculated using a heuristic optimization algorithm.
\(\blacksquare \)

PIC

Figure 4.4: The approximation of a low-dimensional distribution by \(\epsilon \) balls. We can see that as the \(\epsilon \) parameter shrinks, the union of \(\epsilon \)-balls approximates the support of the true distribution (black) increasingly well. Furthermore, the associated denoisers (whose input-output mapping is given by the provided arrows) obtained by approximating the true distribution by a mixture of Gaussians, each with covariance \((\epsilon ^{2}/D)\vI \), increasingly well-approximate the true denoisers. At large \(\epsilon \), such denoisers do not point near the true distribution at all, whereas at small \(\epsilon \) they closely approximate the true denoisers. Theorem 4.1 establishes that this approximation characterizes the rate distortion function at small distortions \(\epsilon \), unifying the parallel approaches of coding rate minimization and denoising for learning low-dimensional distributions without pathologies.

It turns out to be a notoriously hard problem to obtain closed-form expressions for the rate distortion function (4.1.9) for general distributions \(p(\vx )\). However, as Example 4.5 suggests, there are important special cases where the geometry of the support of the distribution \(p(\vx )\) can be linked to the rate distortion function and hence to the optimal coding rate at distortion level \(\epsilon \). In fact, this example can be generalized to any setting where the support of \(p(\vx )\) is a sufficiently regular compact set—including low-dimensional distributions—and \(p(\vx )\) is uniformly distributed on its support. This covers a vast number of cases of practical interest. We formalize this notion in the following result, which establishes this property for a special case.

Theorem 4.1. Suppose that \(\vx \) is a random variable such that its support \(K \doteq \Supp (\vx )\) is a compact set. Define the covering number \(\cN _{\epsilon }(K)\) as the minimum number of balls of radius \(\epsilon \) that can cover \(K\), i.e.,

\begin{equation} \cN _{\epsilon }(K) \doteq \min \left \{n \in \bN \colon \exists \vp _{1}, \dots , \vp _{n} \in K\ \text {s.t.}\ K \subseteq \bigcup _{i = 1}^{n}B_{\epsilon }(\vp _{i})\right \}, \tag{4.1.14} \end{equation}
where \(B_{\epsilon }(\vp ) = \set {\vxi \in \R ^D \given \norm {\vxi - \vp }_2 \leq \epsilon }\) is the Euclidean ball of radius \(\epsilon \) centered at \(\vp \). Then it holds
\begin{equation} \cR _{\epsilon }(\vx ) \leq \log _{2} \cN _{\epsilon }(K). \tag{4.1.15} \end{equation}
If, in addition, \(\vx \) is uniformly distributed on \(K\) and \(K\) is a union of mutually orthogonal low-dimensional hyperspheres,9 then a matching lower bound holds:
\begin{equation} \cR _{\epsilon }(\vx ) \geq \log _{2} \cN _{\epsilon }(K) - O(D). \tag{4.1.16} \end{equation}

Proof. A proof of this theorem is beyond the scope of this book and we defer it to Section B.3. □

The upper bound in Theorem 4.1 requires only that \(\Supp (\vx )\) is a compact set, while our proof of the lower bound requires stronger regularity assumptions: specifically, that \(\vx \) is uniformly distributed on \(\Supp (\vx )\) and that \(\Supp (\vx )\) is a mixture of low-rank Gaussians (with appropriate normalization).10 Under these assumptions, the implication of Theorem 4.1 can be summarized as follows: for sufficiently accurate coding of the distribution of \(\vx \) (i.e., as \(\epsilon \to 0\)), the minimum rate distortion coding framework is equivalent to the sphere packing problem on the support of \(\vx \). So the rate distortion can be thought of as a “probability-aware” way to approximate the support of the distribution of \(\vx \) by a mixture of many small balls (Figure 4.4).

We now discuss another connection between this and the denoising-diffusion-entropy complexity hierarchy we discussed earlier in this chapter.

Remark 4.3. The key ingredient in the proof of the lower bound in Theorem 4.1 is an important result from information theory known as the Shannon lower bound for the rate distortion, named after Claude Shannon, who first derived it in a special case [Sha59]. It asserts the following estimate for the rate distortion function, for any random variable \(\vx \) with a density \(p(\vx )\) and finite expected squared norm [LZ94]:

\begin{equation}\tag{4.1.17}\label {eq:slb} \cR _{\epsilon }(\vx ) \geq h(\vx ) - \log _2 \volume (B_{\epsilon }) - C_D, \end{equation}
where \(C_D > 0\) is a constant depending only on \(D\). Moreover, this lower bound is actually sharp as \(\epsilon \to 0\): that is,
\begin{equation} \lim _{\epsilon \to 0} \cR _{\epsilon }(\vx ) - \left [ h(\vx ) - \log _2 \volume (B_{\epsilon }) - C_D\right ] = 0. \tag{4.1.18} \end{equation}
So when the distortion \(\epsilon \) is small, we can think solely in terms of the Shannon lower bound, rather than the (generally intractable) optimization problem defining the rate distortion (4.1.9).

The Shannon lower bound is the bridge between the coding rate, entropy minimization/denoising, and geometric sphere packing approaches for learning low-dimensional distributions. Notice that in the special case of a uniform density \(p(\vx )\), (4.1.17) becomes

\begin{align} \cR _{\epsilon }(\vx ) &\geq -\int _K \frac {1}{\volume (K)} \log _2 \frac {1}{\volume (K)} \odif \vxi - \log _2 \volume (B_{\epsilon }) - C_D \tag{4.1.19} \\ &= \log _2 \volume (K) / \volume (B_{\epsilon }) - C_D. \tag{4.1.20}\label {eq:shannon-lower-bound-uniform} \end{align}

The ratio \(\volume (K) / \volume (B_{\epsilon })\) approximates the number of \(\epsilon \)-balls needed to cover \(K\) by a worst-case argument, which is accurate for sufficiently regular sets \(K\) when \(\epsilon \) is small (see Section B.3 for details). Meanwhile, recall the Gaussian denoising model \(\vx _{\epsilon } = \vx + \epsilon \vg \) from earlier in the Chapter, where \(\vg \sim \cN (\vzero , \vI )\) is independent of \(\vx \). Interestingly, the differential entropy of the joint distribution \((\vx , \epsilon \vg )\) can be calculated as

\begin{align} h(\vx , \epsilon \vg ) &= -\int p(\vxi ) p(\vgamma ) \log _2 p(\vxi ) p(\vgamma ) \odif \vxi \odif \vgamma \tag{4.1.21} \\ &= h(\vx ) + h(\epsilon \vg ). \tag{4.1.22} \end{align}

We have seen the Gaussian entropy calculated in Equation (3.1.4): when \(\epsilon \) is small, it is equal, up to additive constants, to the quantity \(-\log _2 \volume (B_{\epsilon })\) we have seen in the Shannon lower bound. In certain special cases (e.g., data supported on incoherent low-rank subspaces), when \(\epsilon \) is small and the support of \(\vx \) is sufficiently regular, the distribution of \(\vx _{\epsilon }\) can even be well-approximated locally by the product of the distributions \(p(\vx )\) and \(p(\vg )\), justifying the above computation. Hence the Gaussian denoising process yields yet another interpretation of the Shannon lower bound, as arising from the entropy of a noisy version of \(\vx \), with noise level proportional to the distortion level \(\epsilon \).

Thus, this finite rate distortion approach via sphere covering re-enables or generalizes all previous measures of complexity of the distribution, allowing us to differentiate between and rank different distributions in a unified way. These interrelated viewpoints are visualized in Figure 4.4.

For a general distribution at finite distortion levels, it is typically impossible to find its rate distortion function in an analytical form. One must often resort to numerical computation11. Nevertheless, as we will see, in our context we often need to know the rate distortion as an explicit function of a set of data points or their representations. This is because we want to use the coding rate as a measure of the goodness of the representations. An explicit analytical form makes it easy to determine how to transform the data distribution to improve the representation. So, we should work with distributions whose rate distortion functions take explicit analytical forms. To this end, we start with the simplest, and also the most important, family of distributions.

4.1.3 Lossy Coding Rate for a Low-Dimensional Gaussian

Now suppose we are given a set of data samples in \(\vX = [\vx _1, \ldots , \vx _N]\) from any distribution.12 We would like to come up with a constructive scheme that can encode the data up to certain precision, say

\begin{equation} \vx _i \mapsto \hat \vx _i, \quad \mbox {subject to} \quad \|\vx _i - \hat \vx _i\|_2 \le \epsilon . \tag{4.1.23} \end{equation}

Notice that this is a sufficient, explicit, and interpretable condition which ensures that the data are encoded such that \(\frac {1}{N}\sum _{i = 1}^{N} \norm {\vx _i- \hat \vx _i}_2^{2} \le \epsilon ^{2}\). This latter inequality is exactly the rate distortion constraint for the provided empirical distribution and its encoding. For example, in Example 4.5, we used this simplified criterion to explicitly find the minimum distortion and explicit coding scheme for a given coding rate.

Without loss of generality, let us assume the mean of \(\vX \) is zero, i.e., \(\frac {1}{N} \sum _{i = 1}^{N} \vx _i = \boldsymbol {0}\). Without any prior knowledge about the nature of the distribution behind \(\vX \), we may view \(\vX \) as sampled from a Gaussian distribution \(\mathcal {N}(\boldsymbol {0}, {\boldsymbol {\Sigma }})\) with the covariance13

\begin{equation} {\boldsymbol {\Sigma }} = \frac {1}{N} \vX \vX ^\top . \tag{4.1.24} \end{equation}

Notice that geometrically \(\boldsymbol {\Sigma }\) characterizes an ellipsoidal region where most of the samples \(\vx _i\) reside.

We may view \(\hat \vX = [\hat \vx _1,\ldots , \hat \vx _N]\) as a noisy version of \(\vX = [\vx _1, \ldots , \vx _N]\):

\begin{equation} \hat \vx _i = \vx _i + \vw _i, \tag{4.1.25} \end{equation}

where \(\vw _i \sim \mathcal {N}(\boldsymbol {0} , {\epsilon ^2} \boldsymbol {I}/{D})\) is a Gaussian noise independent of \(\vx _i\). Then the covariance of \(\hat \vx _i\) is given by

\begin{equation} \hat {\boldsymbol {\Sigma }} = \mathbb {E}\left [\hat \vx _i \hat \vx _i^\top \right ] = \frac {\epsilon ^2}{D} \boldsymbol {I} + \frac {1}{N} \vX \vX ^\top . \tag{4.1.26} \end{equation}

Note that the volume of the region spanned by the vectors \(\vx _i\) is proportional to the square root of the determinant of the covariance matrix

\begin{equation} \mbox {volume}(\hat \vx _i) \propto \sqrt {\det \big (\hat {\boldsymbol {\Sigma }}\big )} = \sqrt {\det \left (\frac {\epsilon ^2}{D} \boldsymbol {I} + \frac {1}{N} \vX \vX ^\top \right )}. \tag{4.1.27} \end{equation}

The volume spanned by each random vector \(\boldsymbol {w}_i\) is proportional to

\begin{equation} \mbox {volume}(\boldsymbol {w}_i) \propto \sqrt {\det \left (\frac {\epsilon ^2}{D} \boldsymbol {I} \right )}. \tag{4.1.28} \end{equation}

PIC

Figure 4.5: Covering the region spanned by the data vectors using \(\epsilon \)-balls. The larger the volume of the space, the more balls are needed, hence the more bits are needed to encode and enumerate the balls. Each real-valued vector \(\vx \) can be encoded as the number of the ball which it falls into.

To encode vectors that fall into the region spanned by \(\hat \vx _i\), we can cover the region with non-overlapping balls of radius \(\epsilon \), as illustrated in Figure 4.5. When the volume of the region spanned by \(\hat \vx _i\) is significantly larger than the volume of the \(\epsilon \)-ball, the total number of balls that we need to cover the region is approximately equal to the ratio of the two volumes:

\begin{equation} \# \,\epsilon \mbox {-balls} \approx \frac {\mbox {volume}(\hat \vx _i)}{\mbox {volume}(\vw _i)} = \sqrt {\det \left (\boldsymbol {I} + \frac {D}{N\epsilon ^2} \vX \vX ^\top \right )}. \tag{4.1.29} \end{equation}

If we use binary numbers to label all the \(\epsilon \)-balls in the region of interest, the total number of binary bits needed is thus

\begin{equation} \cR _{\epsilon }(\vX ) \approx \log _2 (\# \,\epsilon \mbox {-balls}) \approx R_{\epsilon }(\vX ) \doteq \frac {1}{2} \log \det \left (\boldsymbol {I} + \frac {D}{N\epsilon ^2} \vX \vX ^\top \right ). \tag{4.1.30}\label {eqn:rate-Gaussian} \end{equation}

Notice that from Example 4.4 one can show that the right-hand side of this formula is the same as the rate distortion (up to precision \(\epsilon \)) of a Gaussian source with covariance matrix \(\frac {1}{N} \vX \vX ^\top + (\epsilon ^{2}/D) \vI \) (e.g., the empirical covariance matrix of the data plus a noise term). In this way, a geometric argument about the volume of the enclosed vectors and sphere packing translates into an information-theoretic argument about the rate distortion of a perturbed Gaussian source.

Example 4.6. Figure 4.5 shows an example of a 2D distribution with an ellipsoidal support – approximating the support of a 2D Gaussian distribution. The region is covered by small balls of size \(\epsilon \). All the balls are numbered from \(1\) to say \(n\). Then given any vector \(\vx \) in this region, we only need to determine to which \(\epsilon \)-ball center it is the closest, denoted as \(\operatorname {ball}_{\epsilon }(\vx )\). To remember \(\vx \), we only need to remember the number of this ball, which takes \(\log (n)\) bits to store. If we need to decode \(\vx \) from this number, we simply take \(\hat \vx \) as the center of the ball. This leads to an explicit encoding and decoding scheme:

\begin{equation} \vx \longrightarrow \operatorname {ball}_{\epsilon }(\vx ) \longrightarrow \hat \vx = \mbox {center of} \operatorname {ball}_{\epsilon }(\vx ). \tag{4.1.31} \end{equation}
One may refer to these ball centers as “codes” of a code book or a dictionary for the encoding scheme. It is easy to see that the accuracy of this (lossy) encoding-decoding scheme is about the radius of the ball \(\epsilon \). Clearly \(\cR _{\epsilon }(\vZ )\) is the average number of bits required to encode the ball number of each vector \(\vz \) with this coding scheme, and hence can be called the coding rate associated with this scheme. \(\blacksquare \)

From the above derivation, we know that the coding rate \(\cR _{\epsilon }(\vX )\) is (approximately) achievable with an explicit encoding (and decoding) scheme. It has two interesting properties:

In our context, the closed-form expression \(R_{\epsilon }(\vX )\) is rather fundamental: it is the coding rate associated with an explicit and natural lossy coding scheme for data drawn from either a Gaussian distribution or a linear subspace. As we will see in the next chapter, this formula plays an important role in understanding the architecture of deep neural networks.

4.1.4 Clustering a Mixture of Low-Dimensional Gaussians

As we have discussed before, the given dataset \(\vX \) often has low-dimensional intrinsic structures. Hence, encoding it as a general Gaussian would be very redundant. If we can identify those intrinsic structures in \(\vX \), we could design much better coding schemes that give much lower coding rates. Or equivalently, the codes used to encode such \(\vX \) can be compressed. We will see that compression gives a unifying computable way to identify such structures. In this section, we demonstrate this important idea with the most basic family of low-dimensional structures: a mixture of (low-dimensional) Gaussians or subspaces.

Example 4.7. Figure 4.6 shows an example in which the data \(\vX \) are distributed around two subspaces (or low-dimensional Gaussians). If they are viewed and coded together as one single Gaussian, the associated discrete (lossy) code book, represented by all the blue balls, is obviously very redundant. We can try to identify the locations of the two subspaces, denoted by \(S_1\) and \(S_2\), and design a code book that only covers the two subspaces, i.e., the green balls. If we can correctly partition samples in the data \(\vX \) into the two subspaces: \(\vX = [\vX _1, \vX _2]\bm \Pi \) with \(\vX _1 \in S_1\) and \(\vX _2 \in S_2\), where \(\bm \Pi \) denotes a permutation matrix, then the resulting coding rate for the data will be much lower. This gives a more parsimonious, hence more desirable, representation of the data. \(\blacksquare \)

PIC

Figure 4.6: Comparison of two lossy coding schemes for data that are distributed around two subspaces. One is to pack (blue) \(\epsilon \)-balls for the entire space spanned by the two subspaces; the other is to pack balls only in a tabular neighborhood around the two subspaces. The latter obviously has a much smaller code book and results in a much lower coding rate for samples on the subspaces.

So, more generally speaking, if the data are drawn from any mixture of subspaces or low-dimensional Gaussians, it would be desirable to identify those components and encode the data based on the intrinsic dimensions of those components. It turns out that we do not lose much generality by assuming that the data are drawn from a mixture of low-dimensional Gaussians. This is because a mixture of Gaussians can closely approximate most general distributions [BDS16].

The clustering problem. Now for this specific family of distributions, how can we effectively and efficiently identify those low-dimensional components from a set of samples

\begin{equation} \vX = \left [\vx _1, \vx _2, \ldots , \vx _N\right ], \tag{4.1.32} \end{equation}

drawn from them? In other words, given the whole data set \(\vX \), we want to partition, or cluster, it into multiple, say \(K\), subsets:

\begin{equation} \vX \bm \Pi = [\vX _1, \vX _2, \dots , \vX _K], \tag{4.1.33} \end{equation}

where each subset consists of samples drawn from only one low-dimensional Gaussian or subspace and \(\bm \Pi \) is a permutation matrix to indicate membership of the partition. Note that, depending on the situation, the partition could be either deterministic or probabilistic. As shown in [MDH+07b], for a mixture of Gaussians, probabilistic partition does not lead to a lower coding rate. So for simplicity, we here consider a deterministic partition only.

Clustering via lossy compression. The main difficulty in solving the above clustering problem is that we normally do not know the number of clusters \(K\), nor do we know the dimension of each component. There has been a long history for the study of this clustering problem. The textbook [VMS16] gives a systematic and comprehensive coverage of different approaches to this problem. To find an effective approach to this problem, we first need to understand and clarify why we want to cluster. In other words, what exactly do we gain from clustering the data, compared with not to? How do we measure the gain? From the perspective of data compression, a correct clustering should lead to a more efficient encoding (and decoding) scheme.

For any given data set \(\vX \), there are already two obvious encoding schemes as the baseline. They represent two extreme ways to encode the data:

Note that either coding scheme can become the “optimal” solution for certain (extreme) choice of the quantization error \(\epsilon \):

1.
Lazy Regime: If we choose \(\epsilon \) to be extremely large, all samples in \(\vX \) can be covered by a single ball. The rate is \(\lim _{\epsilon \rightarrow \infty } \cR _\epsilon \rightarrow \frac {1}{2}\log \det (\boldsymbol {I}) = 0\).
2.
Memorization Regime: If \(\epsilon \) is extremely small, every sample in \(\vX \) is covered by a different \(\epsilon \)-ball, hence the total is \(N\). The rate is \(\lim _{\epsilon \rightarrow 0} \cR _\epsilon \rightarrow \log (N)\).

Note that the first scheme corresponds to the scenario when one does not care about anything interesting about the distribution at all. One does not want to spare any bit for anything informative. We call this the “lazy regime.” The second scheme corresponds to the scenario when one wants to decode every sample with an extremely high precision. So one would better “memorize” every sample. We call this the “memorization regime.”

Notice that these asymptotics in terms of \(\epsilon \) are exactly those discussed when introducing the lossy coding rate as a quantity that realistically-trained diffusion models approximate, as in the end of Section 3.3. In the current situation, we explicitly code the distribution up to precision \(\epsilon \); in the case of diffusion models, we implicitly code the distribution. Certain properties of the scheme such as memorization and generalization manifest similarly in both cases, and indeed one can study one case to understand the other.

PIC

Figure 4.7: A number of random samples on a 2D plane. Consider an \(\epsilon \)-disc assigned to each sample with the sample as its center. The density of the samples increases from left to right.

Example 4.8. To see when the memorization regime is preferred or not, let us consider a number, say \(N\), of samples randomly distributed in a unit area on a 2D plane.14 Imagine we try to design a lossy coding scheme with a fixed quantization error \(\epsilon \). This is equivalent to putting an \(\epsilon \)-disc around each sample, as shown in Figure 4.7. When \(N\) is small, the chance that all the discs overlap with each other is zero. A codebook of size \(N\) is necessary and optimal in this case. When \(N\) or the density reaches a certain critical value \(N_c\), with high probability all the discs start to overlap and connect into one cluster that covers the whole plane—this phenomenon is known as continuum “percolation” [Gil61; MM12]. When \(N\) becomes larger than this value, the discs overlap heavily. The number \(N\) of discs becomes very redundant because we only want to encode points on the plane up to the given precision \(\epsilon \). The number of discs needed to cover all the samples is much less than \(N\).15 \(\blacksquare \)

Both the lazy and memorization regimes are somewhat trivial and perhaps are of little theoretical or practical interest. Either scheme would be far from optimal when used to encode a large number of samples drawn from a distribution that has a compact and low-dimensional support. The interesting regime exists in between these two.

Example 4.9.

PIC PIC
Figure 4.8: Top: 358 noisy samples drawn from two lines and one plane in \(\R ^3\). Bottom: the effect of varying \(\epsilon \) on the clustering result and the coding rate. The red line marks the variance \(\epsilon _0\) of the Gaussian noise added to the samples. Note that the bottom left and middle plots show the same quantity with different axis locations and scales.

Figure 4.8 shows an example with noisy samples drawn from two lines and one plane in \(\R ^3\). As we notice from the plot (c) on the right, the optimal coding rate decreases monotonically as we increase \(\epsilon \), as anticipated from the property of the rate distortion function. The plots (a) and (b) show, when varying \(\epsilon \) from very small (near zero) to very large (towards infinite), the optimal number of clusters when the coding rate is minimal. We can clearly see the lazy regime and the memorization regime on the two ends of the plots. But one can also notice in plot (b), when the quantization error \(\epsilon \) is chosen to be around the level of the true noise variance \(\epsilon _0\), the optimal number of clusters is the “correct” number three that represents two lines and one plane. We informally refer to this middle regime as the “generalization regime”. Notice that a sharp phase transition takes place between these regimes.16 \(\blacksquare \)

From the above discussion and examples, we see that, when the quantization error relative to the sample density17 is in a proper range, minimizing the lossy coding rate would allow us to uncover the underlying (low-dimensional) distribution of the sampled data. Hence, quantization, started as a choice of practicality, seems to be becoming necessary for learning a continuous distribution from its empirical distribution with finite samples. Although a rigorous theory for explaining this phenomenon remains elusive, here, for learning purposes, we care about how to exploit the phenomenon to design algorithms that can find the correct distribution.

Let us use the simple example shown in Figure 4.6 to illustrate the basic ideas. If one can partition all samples in \(\vX \) into two clusters in \(\vX _1\) and \(\vX _2\), with \(N_1\) and \(N_2\) samples respectively, then the associated coding rate would be18

\begin{equation} R_{\epsilon }^c(\vX \mid \boldsymbol {\Pi }) = \frac {N_1}{N}R_{\epsilon }(\vX _1) + \frac {N_2}{N}R_{\epsilon }(\vX _2), \tag{4.1.36} \end{equation}

where we use \(\boldsymbol {\Pi }\) to indicate membership of the partition. If the partition respects the low-dimensional structures of the distribution, in this case \(\vX _1\) and \(\vX _2\) belonging to the two subspaces respectively, then the resulting coding rate should be significantly smaller than the above two basic schemes:

\begin{equation} R_{\epsilon }^c(\vX \mid \boldsymbol {\Pi }) \ll R_{\epsilon }(\vX ), \quad R_{\epsilon }^c(\vX \mid \boldsymbol {\Pi }) \ll R_0(\vX ). \tag{4.1.37} \end{equation}

In general, we can cast the clustering problem into an optimization problem that minimizes the coding rate:

\begin{equation} \min _{\boldsymbol {\Pi }} \bc { R_{\epsilon }^c(\vX \mid \boldsymbol {\Pi }) \doteq \sum _{k=1}^K \frac {N_k}{N}R_{\epsilon }(\vX _k)}. \tag{4.1.38} \end{equation}

Optimization strategies to cluster. The remaining question is how we optimize the above coding rate objective to find the optimal clusters. There are three natural approaches to this objective:

1.
We may start with the whole set \(\vX \) as a single cluster (i.e. the lazy regime) and then search (say randomly) to partition it so that it would lead to a smaller coding rate.
2.
Inversely, we may start with each sample \(\vx _i\) as its own cluster (i.e. the memorization regime) and search to merge clusters that would result in a smaller coding rate.
3.
Alternatively, if we could represent (or approximate) the membership \(\boldsymbol {\Pi }\) as some continuous parameters, we may use optimization methods such as gradient descent (GD).

The first approach is not so appealing computationally as the number of possible partitions that one needs to try is exponential in the number of samples. For example, the number of partitions of \(\vX \) into two subsets of equal size is \(N \choose N/2\) which explodes as \(N\) becomes large. We will explore the third approach in the next Chapter 5. There, we will see how the role of deep neural networks, transformers in particular, is connected with the coding rate objective.

The second approach was originally suggested in the work of [MDH+07b]. It demonstrates the benefit of being able to evaluate the coding rate efficiently (say with an analytical form). With it, the (low-dimensional) clusters of the data can be found rather efficiently and effectively via the principle of minimizing coding length (MCL). Note that for a cluster \(\vX _k\) with \(N_k\) samples, the length of binary bits needed to encode all the samples in \(\vX _k\) is given by:19

\begin{equation} L(\vX _k) = N_k R_\epsilon (\vX _k). \tag{4.1.39} \end{equation}

If we have two clusters \(\vX _k\) and \(\vX _l\), if we want to code the samples as two separate clusters, the length of binary bits needed is

\begin{equation*} L^c(\vX _k, \vX _l) = N_k R_\epsilon (\vX _k) + N_l R_\epsilon (\vX _l) - N_k \log \frac {N_k}{N_k + N_l} - N_l \log \frac {N_l}{N_k + N_l}. \end{equation*}

The last two terms are the number of bits needed to encode the memberships of samples according to the Huffman code.

Then, given any two separate clusters \(\vX _1\) and \(\vX _2\), we can decide whether to merge them or not based on the difference between the two coding lengths:

\begin{equation} L(\vX _k \cup \vX _l) - L^c(\vX _k, \vX _l) \tag{4.1.40} \end{equation}

is positive or negative and \(\vX _k \cup \vX _l\) denotes the union of the sets of samples in \(\bm X_k\) and \(\bm X_l\). If it is positive , it means the coding length would become smaller if we merge the two clusters into one. This simple fact leads to the following clustering algorithm proposed by [MDH+07b]:

Algorithm 4.1: Pairwise Steepest Descent of Coding Length
Input:
\(N\) data points \(\{\vx _{i}\}_{i = 1}^{N}\)
Output:
A set \(\cC \) of clusters
1:
procedure PairwiseSteepestDescentOfCodingLength(\(\{\vx _{i}\}_{i = 1}^{N}\))
# Initialize \(N\) clusters \(\vX _{k}\) with one element each
2:
\(\cC \gets \{\{\bm x_{i}\}\}_{i = 1}^{N}\)
3:
while \(\abs {\cC } > 1\) do
# If no bits are saved by any merging
4:
if \(\displaystyle \min _{\vX _{k}, \vX _{l} \in \cC }[L(\vX _{k} \cup \vX _{l}) -L^{c}(\vX _{k}, \vX _{l})] \geq 0\) then
# Early return \(\cC \) and exit
5:
return \(\cC \)
6:
else
# Merge clusters which save the most bits
7:
\(\displaystyle \vX _{k^{\ast }}, \vX _{l^{\ast }} \gets \argmin _{\vX _{k}, \vX _{l} \in \cC }[L(\vX _{k} \cup \vX _{l}) - L^{c}(\vX _{k}, \vX _{l})]\)
# Remove unmerged clusters and add back the merged one
8:
\(\displaystyle \cC \gets [\cC \setminus \{\vX _{k^{\ast }}, \vX _{l^{\ast }}\}] \cup \{\vX _{k^{\ast }} \cup \vX _{l^{\ast }}\}\)
9:
end if
10:
end while
# If all merges yield savings, return one cluster
11:
return \(\cC \)
12:
end procedure

Note that this algorithm is tractable as the total number of (pairwise) comparisons and merges is about \(O(N^2\log N)\). However, due to its greedy nature, there is no theoretical guarantee that the process will converge to the globally optimal clustering solution. Nevertheless, as reported in [MDH+07b], in practice, this seemingly simple algorithm works extremely well. The clustering results plotted in Figure 4.8 were actually computed by this algorithm.

Example 4.10 (Image Segmentation). The above measure of coding length and the associated clustering algorithm assume the data distribution is a mixture of (low-dimensional) Gaussians. Although this seems somewhat idealistic, the measure and algorithm can already be very useful and even powerful in scenarios when the model is (approximately) valid.

For example, a natural image typically consists of multiple regions with nearly homogeneous textures. If we take many small windows from each region, they should resemble samples drawn from a (low-dimensional) Gaussian, as illustrated in Figure 4.9. Figure 4.10 shows the results of image segmentation based on applying the above clustering algorithm to the image patches directly. More technical details regarding customizing the algorithm to the image segmentation problem can be found in [MRY+11]. \(\blacksquare \)

PIC

Figure 4.9: Image patches with a size of \(w\times w\) pixels.

PIC

Figure 4.10: Segmentation results based on the clustering algorithm applied to the image patches.

4.2 Transformation via Maximizing Information Gain

So far in this chapter, we have discussed how to identify a distribution with low-dimensional structures through the principle of compression. As we have seen from the previous two sections, computational compression can be realized through either the denoising operation or clustering. Figure 4.11 illustrates this concept with our favorite example.

It is important to note that the clustering algorithm and the coding rate formulas developed in the previous section assumed that the data distribution is already (approximately) a mixture of low-dimensional Gaussians. In practice, the raw data—images, audio, text—rarely satisfy this assumption directly: their intrinsic structures are typically nonlinear. The purpose of this section is to develop a framework for transforming data so that their representations do take the form of a mixture of incoherent low-dimensional subspaces, at which point the tools from the preceding section become directly applicable. That is, the preceding section answers “how to compress and cluster data that are already Gaussian,” while this section answers “how to transform data so that they become Gaussian.” As we will see in Chapter 5, deep networks are precisely the computational mechanism for realizing such transformations, with the network’s architecture derived from optimizing the coding rate objective developed here.

Indeed, the ultimate goal for identifying a data distribution is to use it to facilitate certain subsequent tasks such as segmentation, classification, or generation (of images). Hence, how the resulting distribution is “represented” matters tremendously with respect to how information related to these subsequent tasks can be efficiently and effectively retrieved and utilized. This naturally raises a fundamental question: what makes a representation truly “good” for downstream use? In the following, we will explore the essential properties that a meaningful and useful representation should possess, and how these properties can be explicitly characterized and pursued via maximizing information gain.

How to measure the goodness of representations. One may view a given dataset as samples of a random vector \(\vx \) with a certain distribution in a high-dimensional space, say \(\R ^D\). Typically, the distribution of \(\vx \) has a much lower intrinsic dimension than the ambient space. Generally speaking, learning a representation refers to learning a continuous mapping, say \(f(\cdot )\), that transforms \(\vx \) to a so-called feature vector \(\vz \) in another (typically lower-dimensional) space, say \(\R ^d\), where \(d < D\). It is hopeful that through such a mapping

\begin{equation} \vx \in \R ^D \xrightarrow { f(\vx )} \vz \in \R ^d, \tag{4.2.1}\label {eqn:chap4-1-encoding} \end{equation}

the low-dimensional intrinsic structures of \(\vx \) are identified and represented by \(\vz \) in a more compact and structured way so as to facilitate subsequent tasks such as classification or generation. The feature \(\vz \) can be viewed as a (learned) compact code for the original data \(\vx \), so the mapping \(f\) is also called an encoder. The fundamental question of representation learning is

What is a principled and effective measure for the goodness of representations?

PIC

Figure 4.11: Identifying a low-dimensional distribution with two subspaces (left) via denoising or clustering, starting from a generic random Gaussian distribution (right).

Conceptually, the quality of a representation \(\vz \) depends on how well it identifies the most relevant and sufficient information of \(\vx \) for subsequent tasks and how efficiently it represents this information. For a long time, it was believed and argued that the “sufficiency” or “goodness” of a learned feature representation should be defined in terms of a specific task. For example, \(\vz \) just needs to be sufficient for predicting the class label \(\y \) in a classification problem. Below, let us start with the classic problem of image classification and argue why such a notion of a task-specific “representation” is limited and needs to be generalized.

4.2.1 Representation Learning from Classified Data

Suppose that \(\bm {x} \in \R ^D\) is a random vector drawn from a mixture of \(K\) (component) distributions \(\mathcal {D} = \{\mathcal {D}_k\}_{k=1}^K\). Given a finite set of i.i.d. samples \(\vX = [\vx _1, \vx _2, \ldots , \vx _N] \in \Re ^{D\times N}\) of the random vector \(\bm x\), we seek a good representation through a continuous mapping \(f(\vx ): \R ^D \rightarrow \R ^d\) that captures intrinsic structures of \(\vx \) and best facilitates the subsequent classification task.20 To ease the task of learning distribution \(\mathcal {D}\), in the popular supervised classification setting, a true class label (or a code word for each class), usually represented by a one-hot vector \(\y _i \in \R ^K\), is given for each sample \(\vx _i\).

Encoding class information via cross entropy. Extensive studies have shown that for many practical datasets (e.g., images, audio, and natural languages), the (encoding) mapping from the data \(\bm {x}\) to its class label \(\bm {y}\) can be effectively modeled by training a deep network,21 here denoted as

\[f(\vx , \theta ):\vx \mapsto \y \]

with network parameters \(\theta \in \Theta \), where \(\Theta \) denotes the parameter space. For the output \(f(\vx , \theta )\) to match well with the label \(\y \), we like to minimize the cross-entropy loss over a training set \(\{(\vx _i, \y _i)\}_{i=1}^N\):22

\begin{equation} \min _{\theta \in \Theta } \; \mathbb {E}[\CE (\vy \mmid f(\vx , \theta ))] \, \approx - \frac {1}{N}\sum _{i=1}^N \langle \y _i, \log \left (f(\vx _i, \theta )\right ) \rangle . \tag{4.2.2}\label {chap4-eqn:cross-entropy} \end{equation}

The optimal network parameters \(\theta \) are typically found by optimizing the above objective through an efficient gradient descent scheme, with gradients computed via back propagation, as described in Section A.2.3 of Chapter A.

Despite its effectiveness and enormous popularity, there are two serious limitations with this approach: (1) It aims only to predict the labels \(\y \) even if they might be mislabeled. Empirical studies show that deep networks, used as a “black box,” can even fit random labels [ZBH+17]. (2) With such an end-to-end data fitting, despite plenty of empirical efforts in trying to interpret the so-learned features, it is not clear to what extent the intermediate features learned by the network capture the intrinsic structures of the data that make meaningful classification possible in the first place. The precise geometric and statistical properties of the learned features are also often obscured, which leads to the lack of interpretability and subsequent performance guarantees (e.g., generalizability, transferability, and robustness, etc.) in deep learning. Therefore, one of the goals of this section is to address such limitations by reformulating the objective towards learning explicitly meaningful and useful representations for the data \(\vx \), not limited to classification.

PIC

Figure 4.12: Evolution of penultimate layer outputs of a VGG13 neural network when trained on the CIFAR-10 dataset with 3 randomly selected classes. Figure from [PHD20].

Minimal discriminative features via information bottleneck. One popular approach to interpret the role of deep networks is to view outputs of intermediate layers of the network as selecting certain latent features \(\vz = f(\vx , \theta ) \in \Re ^d\) of the data that are discriminative among multiple classes. Learned representations \(\vz \) then facilitate the subsequent classification task for predicting the class label \(\y \) by optimizing a classifier \(g(\vz )\):

\begin{equation} \vx \xrightarrow { f(\vx , \theta )} \vz \xrightarrow { g(\vz ) } \y . \tag{4.2.3} \end{equation}

We know from (4.1.13) that the mutual information between two random variables, say \(\vx ,\vz \), is defined to be

\begin{equation} I(\vx ; \vz ) = H(\vx ) - H(\vx \mid \vz ), \tag{4.2.4} \end{equation}

where \(H(\vx \vert \vz )\) is the conditional entropy of \(\vx \) given \(\vz \). The mutual information is also known as the information gain: It measures how much the entropy of the random variable \(\vx \) can be reduced once \(\vz \) is given. Or equivalently, it measures how much information \(\vz \) contains about \(\vx \). The information bottleneck (IB) formulation [TZ15] further hypothesizes that the role of the network is to learn \(\vz \) as the minimal sufficient statistics for predicting \(\y \). Formally, it seeks to maximize the mutual information \(I(\vz , \y )\) between \(\vz \) and \(\y \) while minimizing the mutual information \(I(\vx , \vz )\) between \(\vx \) and \(\vz \):

\begin{equation} \max _{\theta \in \Theta }\; \mbox {IB}(\vx , \y , \vz ) \doteq I(\vz ; \y ) - \beta I(\vx ; \vz ) \quad \ \mathrm {s.t.}\ \vz = f(\vx , \theta ), \tag{4.2.5}\label {chap4-eqn:information-bottleneck} \end{equation}

where \(\beta >0\).

Given one can overcome some caveats associated with this framework [KTV18], such as how to accurately evaluate mutual information with finite samples of degenerate distributions, this framework can be helpful in explaining certain behaviors of deep networks. For example, recent work [PHD20] indeed shows that the representations learned via the cross-entropy loss (4.2.2) exhibit a neural collapse phenomenon. That is, features of each class are mapped to a one-dimensional vector whereas all other information of the class is suppressed, as illustrated in Figure 4.12.23

Remark 4.4 (Neural Collapse). Neural collapse refers to a phenomenon observed in deep neural networks trained for classification, where the learned feature representations and classifier weights exhibit highly symmetric and structured behavior during the terminal phase of training [PHD20; YWZ+22; ZDZ+21]. Specifically, within each class, features collapse to their class mean, and across classes, these means become maximally separated, forming a simplex equiangular configuration. The linear classifier aligns with the class mean up to rescaling. Additionally, the last-layer classifier converges to choosing whichever class has the nearest train class mean. Neural collapse reveals deep connections between optimization dynamics, generalization, and geometric structures arising in supervised learning.

From the above example of classification, we see that the so-learned representation \(\vz \) of \(\vx \) gives a very simple encoder that essentially maps each class of data to only one code word: the one-hot vector representing each class. From the lossy compression perspective, such an encoder is too lossy to preserve information in the data distribution. Other information, such as that useful for tasks such as image segmentation, completion, or generation, is severely lost in such a supervised learning process.

To remedy this situation, we want to learn a different encoding scheme such that the resulting feature representation \(\vz \) can capture as much information about the data distribution as possible, not limited to that useful for classification alone. More precisely, we want the representation \(\vz = f(\vx , \theta )\) to be such that there exists a decoder \(g(\vz ,\eta )\) so that:

\begin{equation} \|\vx - \hat \vx \|_2 \le \epsilon \quad \text {s.t.} \quad \hat {\vx } = g(f(\vx )). \tag{4.2.6} \end{equation}

We will formally introduce such an encoding-decoding scheme as an “autoencoder” and such a representation \(\vz \) as a “consistent” representation in Chapter 6. For now, the reader may simply view this as a generalization to the quantization code \(\hat \vx \) introduced in the preceding section for the rate distortion (in the native data \(\vx \) domain).

In the formulation of information bottleneck (4.2.5), it does not require the learned feature representation \(\vz \) to be consistent. Hence, at least it should be fixed with an objective for a (consistent) representation learning (RL):

\begin{eqnarray} \max _{\theta ,\eta }\; \mbox {RL}(\vx , \y , \vz ) \doteq I(\vz ; \y ) - \beta I(\vx ; \vz ) \nonumber \\ \mathrm {s.t.}\quad \vz = f(\vx ,\theta ),\quad \|\vx - g(\vz , \eta ) \|_2 \le \epsilon . \tag{4.2.7}\label {chap4-eqn:representation-learning} \end{eqnarray}

Solving the above problem in the most general setting is a very daunting task. In general, we do not even know what classes of encoders or decoders could ensure the consistency24 Also, since we do not know the distribution of the representation \(\vz \), computing the mutual information \(I(\vz ;\y )\) can be very challenging, if not impossible.

Before we study how to approach this problem in general settings (in later Chapters), for the rest of this chapter, we will demonstrate how to solve it, in some limited way, for the special setting when the data distribution consists of clear (low-dimensional) classes.

4.2.2 Linear Discriminative Representations

Whether the given data \(\vX \) of a mixed distribution \(\mathcal {D}\) can be effectively classified or clustered depends on how separable (or discriminative) the component distributions \(\mathcal {D}_k\) are (or can be made). One popular working assumption is that the distribution of each class has relatively low-dimensional intrinsic structures. Hence we may assume that the distribution \(\mathcal {D}_k\) of each class has a support on a low-dimensional submanifold, say \(\mathcal {M}_k\) with dimension \(d_k \ll D\), and the distribution \(\mathcal D\) of \(\vx \) is supported on the mixture of those submanifolds, \(\mathcal M = \cup _{k=1}^K \mathcal {M}_k\), in the high-dimensional ambient space \(\Re ^D\), as illustrated before in Figure 4.1.

Not only do we need to identify the low-dimensional distribution, but we also want to represent the distribution in a form that best facilitates subsequent tasks such as classification, clustering, and conditioned generation (as we will see in the future). To do so, we require our learned feature representations to have the following properties:

1.
Within-Class Compressible: Features of samples from the same class should be strongly correlated in the sense that they belong to a low-dimensional linear subspace.
2.
Between-Class Discriminative: Features of samples from different classes should be highly uncorrelated and belong to different incoherent low-dimensional linear subspaces.
3.
Maximally Diverse Representation: Dimension (or variance) of the features of each class should be as large as possible as long as they are incoherent to the other classes.

We refer to such a representation the linear discriminative representation (LDR).

Notice that the first property aligns well with the objective of the classic principal component analysis (PCA) that we have discussed in Section 2.1.1. It essentially says that features of samples that belong to the same class25 should become or be made highly correlated. In other words, feature representation should be contractive for samples that are similar.

PIC

Figure 4.13: The distribution \(\mathcal D\) of high-dimensional data \(\vx \in \Re ^D\) is supported on a manifold \(\mathcal {M}\) and its classes on low-dimensional submanifolds \(\mathcal {M}_k\). We aim to learn a mapping \(f(\vx , \theta )\) parameterized by \(\theta \) such that \(\vz _i = f(\vx _i, \theta )\) lie on a union of maximally uncorrelated subspaces \(\{\mathcal {S}_k\}\).

The second property requires that for samples that belong to different classes or are believed to be dissimilar, their features should be made (geometrically) incoherent or (statistically) independent, as illustrated before in Figure 4.1. Figure 4.13 illustrates when the data distribution is actually a mixture of low-dimensional submanifolds. Through compression (denoising or clustering), we first identify that the true data distribution (left). We then would like to transform the distribution so that the submanifolds eventually become mutually incoherent/independent linear subspaces (right).

Remark 4.5 (Linear Discriminative Analysis). One may notice that the second criterion resembles that of the classic linear discriminant analysis (LDA) [HTF09]. Linear discriminant analysis (LDA) [HTF09] is a supervised dimensionality reduction technique that aims to find a linear projection of data that maximizes class separability. Specifically, given labeled data, LDA seeks a linear transformation that projects high-dimensional inputs onto a lower-dimensional space where the classes are maximally separated. Note that PCA is an unsupervised method that projects data onto directions of maximum variance without considering class labels. While PCA focuses purely on preserving global variance structure, LDA explicitly exploits label information to enhance discriminative power; see the comparison in Figure 4.14.

PIC
Figure 4.14: Comparison between PCA and LDA. Figures adopted from https://sebastianraschka.com/Articles/2014_python_lda.html.

But LDA only performs a linear projection of the original data. As we will see LDR can be more general: it seeks a, probably nonlinear, transformation to make the resulting representation linear and incoherent.

Remark 4.6 (Contrastive Learning). One may notice that the conventional contrastive learning [HCL06; HFW+19; OLV18] aims to achieve similar goals for feature representation as the first two properties stipulate. For data that belong to \(k\) different classes, a randomly chosen pair \((\vx _i, \vx _j)\) is of high probability belonging to different classes if \(k\) is large.26 Hence it is desirable that the representation \(\vz _i = f(\vx _i, \bm \theta )\) of a sample \(\vx _i\) should be highly incoherent to those \(\vz _j\) of other samples \(\vx _j\) whereas coherent to feature of its transformed version \(\tau (\vx _i)\), denoted as \(\vz (\tau (\vx _i))\) for \(\tau \) in certain augmentation set \(\mathcal T\) in consideration. Hence it was proposed heuristically that, to promote discriminativeness of the learned representation, one may seek to minimize the so-called contrastive loss:

\begin{equation} \min _{\bm \theta } - \log \frac {\exp (\langle \vz _i, \vz (\tau (\vx _i))\rangle )}{\sum _{j\not = i} \exp (\langle \vz _i, \vz _j\rangle )}, \tag{4.2.8}\label {eqn:contrastive-loss} \end{equation}
which is small whenever the inner product \(\langle \vz _i, \vz (\tau (\vx _i))\rangle \) is large and \(\langle \vz _i, \vz _j\rangle \) is small for \(i \not = j\). Such contrastive loss, and many of its variants, has been widely used in popular (self-supervised) feature learning methods such as the DINO-type features [CMM+21; ODM+24] for visual data. As we will see, based on the lossy coding framework studied in the previous section, we will be able to arrive at a principled objective function that promotes the two properties and at the same time significantly simplifies the practice such as learning the DINO-type features (as we will feature in great detail in Chapter 8).

The third property is also important because we want the learned features to reveal all possible causes of why one class is different from all other classes. For example, to tell “apple” from “orange”, we care not only about color but also shape and the leaves. Ideally, the dimension of each subspace \(\{\mathcal {S}_k\}\) should be equal to that of the corresponding submanifold \(\mathcal {M}_k\). This property will be important if we would like the map \(f(\vx ,\theta )\) to be invertible for tasks such as image generation. For example, if we draw different sample points from the feature subspace for “apple”, we should be able to decode them, via an inverse map \(g(\vz , \eta )\), to generate diverse images of apples. Such a pair of mutually inverse mappings \(f\) and \(g\) makes an autoencoding which we will study in great detail in Chapter 6. Notice that feature representations learned from minimizing the cross entropy (4.2.2) normally do not have this property, as they suffer the so-called neural collapse phenomenon discussed earlier (see Remark 4.4).

In general, although the intrinsic structures of each class/cluster may be low-dimensional, they are by no means simply linear (or Gaussian) in their original representation \(\vx \) and they need to be made linear first, through some nonlinear transformation.27 Therefore, overall, we use the nonlinear transformation \(f(\vx ,\theta )\) to seek a representation of the data such that the subspaces that represent all the classes are maximally incoherent linear subspaces. To be more precise, we want to learn a mapping \(\vz = f(\vx ,\theta )\) that maps each of the submanifolds \(\mathcal {M}_k \subset \Re ^D\) (Figure 4.13 left) to a linear subspace \(\mathcal {S}_k \subset \Re ^d\) (Figure 4.13 right). To some extent, the resulting multiple subspaces \(\{\mathcal {S}_k\}\) can be viewed as discriminative generalized principal components [VMS16] or, if orthogonal, independent components [HO00a] of the resulting features \(\vz \) for the original data \(\bm x\). As we will see in the next Chapter 5, deep networks precisely play the role of modeling and realizing this nonlinear transformation from the data distribution to linear discriminative representations.

4.2.3 The Principle of Maximal Coding Rate Reduction

Although the three properties—between-class discriminative, within-class compressible, and maximally diverse representation—for linear discriminative representations (LDRs) are all highly desired properties of the learned representation \(\vz \), they are by no means easy to obtain: Are these properties compatible so that we can expect to achieve them all at once? If so, is there a simple but principled objective that can measure the goodness of the resulting representations in terms of all these properties? The key to these questions is to find a principled “measure of compactness” or “information gain” for the distribution of a random variable \(\vz \) or from its finite samples \(\{\bm z_i\}_{i=1}^N\). Such a measure should directly and accurately characterize intrinsic geometric or statistical properties of the distribution, in terms of its intrinsic dimension or volume. Unlike the cross entropy (4.2.2) or information bottleneck (4.2.5), such a measure should not depend exclusively on class labels so that it can work in more general settings such as supervised, self-supervised, semi-supervised, and unsupervised settings (as we will see in later sections).

Without loss of generality, assume that the distribution \(\mathcal D\) of the random vector \(\vx \) is supported on a mixture of distributions, i.e., \(\mathcal D = \cup _{k=1}^K \mathcal {D}_k\), where each \(\mathcal {D}_k \subset \Re ^D\) has a low intrinsic dimension in the high-dimensional ambient space \(\Re ^D\). Let \(\bm X_k \in \Re ^{D\times N_k}\) denote the data matrix whose columns are samples drawn from the distribution \(\mathcal {D}_k\), where \(N_k\) denotes the number of samples for each \(k=1,\dots ,K\). Then, we use \(\bm X=[\bm X_1,\dots ,\bm X_K] \in \Re ^{D\times N}\) to denote all the samples, where \(N=\sum _{k=1}^K N_k\). Recall that we also use \(\vx _i\) to denote the \(i\)-th sample of \(\vX \), i.e., \(\vX =[\vx _1,\dots ,\vx _N] \). Under an encoding mapping:

\begin{equation} \vx \xrightarrow { f(\vx )} \vz , \tag{4.2.9} \end{equation}

the input samples are mapped to \(\vz _i = f(\vx _i)\) for each \(i=1,\dots ,N\). With an abuse of notation, we also write \(\vZ _k = f(\vX _k)\) and \(\vZ = f(\vX )\). Therefore, we have \(\vZ = [\bm Z_1,\dots ,\bm Z_K]\) and \(\vZ = [\vz _1,\dots \vz _N]\).

On one hand, for learned features to be discriminative, features of different classes/clusters are preferred to be maximally incoherent to each other. Hence, they together should span a space of the largest possible volume (or dimension) and the coding rate of the whole set \(\vZ \) should be as large as possible. On the other hand, learned features of the same class/cluster should be highly correlated and coherent. Hence, each class/cluster should only span a space (or subspace) of a very small volume and the coding rate should be as small as possible. Now, we will introduce how to measure the coding rate of the learned features.

Coding rate of features. Notably, a practical challenge in evaluating the coding rate is that the underlying distribution of the feature representations \(\vZ \) is typically unknown. To address this, we may approximate the features \(\vZ = [\vz _1, \ldots , \vz _N]\) as samples drawn from a multivariate Gaussian distribution. Under this assumption, as discussed in Section 4.1.3, the compactness of the features \(\vZ \) as a whole can be measured in terms of the average coding length per sample, referred to as the coding rate, subject to a precision level \(\epsilon > 0\) (see (4.1.30)) defined as follows:

\begin{equation} R_{\epsilon }(\vZ ) = \frac {1}{2}\logdet \left (\I + \frac {d}{N\epsilon ^{2}}\vZ \vZ ^{\top }\right ). \tag{4.2.10}\label {chap4-eqn:coding-length-eval} \end{equation}

On the other hand, we hope that a nonlinear transformation \(f(\vx )\) maps each class-specific submanifold \(\mathcal {M}_k \subset \R ^D\) to a maximally incoherent linear subspace \(\mathcal {S}_k \subset \R ^d\) such that the learned features \(\vZ \) lie in a union of low-dimensional subspaces. This structure allows for a more accurate evaluation of the coding rate by analyzing each subspace separately. Recall that the columns of \(\vZ _k\) denotes the features of the samples in \(\vX _k\) for each \(k=1,\dots ,K\). The coding rate for the features in \(\bm Z_k\) can be computed as follows:

\begin{align} R_{\epsilon }(\vZ _k) = \frac {N_k}{2N}\log \det \left (\I + \frac {d}{N_k\epsilon ^{2}}\vZ _k\vZ _k^{\top }\right ) \tag{4.2.11} \end{align}

Then, the sum of the average coding rates of features in each class is

\begin{equation} R_{\epsilon }^c(\vZ ) \doteq \sum _{k=1}^K R_{\epsilon }(\vZ _k), \tag{4.2.12}\label {chap4-eqn:compress-loss-eval} \end{equation}

Therefore, a good representation \(\vZ \) of \(\vX \) is the one that achieves a large difference between the coding rate for the whole and that for all the classes:

\begin{equation} \Delta R_{\epsilon }(\vZ ) \doteq R_{\epsilon }(\vZ ) - R_{\epsilon }^c(\vZ ). \tag{4.2.13}\label {chap4-eqn:coding-length-reduction} \end{equation}

Notice that, as per our discussions earlier in this chapter, this difference can be interpreted as the amount of “information gained” by identifying the correct low-dimensional clusters \(\vZ _k\) within the overall set \(\vZ \).

If we choose our feature mapping \(f(\cdot )\) to be a deep neural network \(f(\cdot ,\theta )\) with network parameters \(\theta \), the overall process of the feature representation and the resulting rate reduction can be illustrated by the following diagram:

\begin{equation} \vX \xrightarrow { f(\vx , \theta )} \vZ \xrightarrow { \epsilon } \Delta R_{\epsilon }(\vZ ). \tag{4.2.14}\label {chap4-eqn:flow} \end{equation}

Note that \(\Delta R_{\epsilon }\) is monotonic in the scale of the features \(\vZ \). To ensure fair comparison across different representations, it is essential to normalize the scale of the learned features. This can be achieved by either imposing the Frobenius norm of each class \(\vZ _k\) to scale with the number of features in \(\vZ _k \in \mathbb R^{d \times N_k}\), i.e., \(\|\vZ _k\|_F^2 = N_k\), or by normalizing each feature to be on the unit sphere, i.e., \(\vz _i \in \mathbb {S}^{d-1}\), where \(N_k\) denotes the number of samples in the \(k\)-th class. This formulation offers a natural justification for the need for “batch normalization” in the practice of training deep neural networks [IS15].

Once the representations are comparable, the goal becomes to learn a set of features \(\vZ = f(\vX , \theta )\) such that they maximize the reduction between the coding rate of all features and that of the sum of features w.r.t. their classes:

\begin{equation} \begin {aligned} \max _{\theta } & \; \Delta R_{\epsilon }\big (\vZ \big ) \doteq R_{\epsilon }(\vZ ) - R_{\epsilon }^c(\vZ ), \\ \mbox {s.t.} & \ \ \, \vZ = f(\bm X, \theta ),\ \|\vZ _k\|_F^2 = N_k,\ k=1,\dots ,K. \end {aligned} \tag{4.2.15}\label {eqn:maximal-rate-reduction} \end{equation}

We refer to this as the principle of maximal coding rate reduction (MCR\(^2\)), as an embodiment of Aristotle’s famous quote:

The whole is greater than the sum of its parts.

To learn the best representation, we require that the whole is maximally greater than the sum of its parts. To a large extent, the quantity \(\Delta R\) can be viewed as a computable realization to the mutual information term \(I(\vz ; \y )\) introduced in the objective (4.2.7).

Example 4.11. To further elucidate the ideas, let us examine the example shown in Figure 4.15.

PIC
Figure 4.15: After identifying the low-dimensional data distribution, we would like to further transform the data distribution to a more informative structured representation: \(R\) is the number of \(\epsilon \)-balls covering the whole space and \(R^c\) is the sum of the numbers for all the subspaces (the green balls). \(\Delta R\) is their difference (the number of blue balls).

From a compression perspective, the representation on the right is the most compact one in the sense that the difference between the coding rate when all features are encoded as a single Gaussian (blue) and that when the features are properly clustered and encoded as two separate subspaces (green) is maximal.28 \(\blacksquare \)

Note that the above MCR\(^2\) principle is designed for supervised learning problems, where the group memberships (or class labels) are known. There are a variety of ways to use the principle, or suitable generalizations, in ways which circumvent this requirement for class labels; we will discuss these later in this Chapter and in the next Chapter 5. To get a first taste, we extend the principle to unsupervised learning problems by introducing a membership matrix, which encodes the (potentially soft) assignment of each data point to latent groups or clusters. Specifically, let \(\bm \Pi = \{\bm \Pi _k\}_{k=1}^K \subset \R ^{N\times N}\) be a set of diagonal matrices whose diagonal entries encode the membership of the \(N\) samples into \(K\) classes. That is, \(\bm \Pi \) lies in a simplex \(\Omega \doteq \{\bm \Pi : \bm \Pi _k \ge \bm 0: \sum _{k=1}^K \bm \Pi _k = \bm I_N\}\). Then, we can define the average coding rate with respect to the partition \(\bm \Pi \) as

\begin{align}\tag{4.2.16}\label {eq:MCRc} R_{\epsilon }^c(\vZ \mid \bm \Pi ) \doteq \sum _{k=1}^K \frac {\mathrm {tr}(\bm \Pi _k)}{2N}\log \det \left (\bm I + \frac {d}{\mathrm {tr}(\bm \Pi _k)\epsilon ^2}\bm Z\bm \Pi _k \bm Z^\top \right ). \end{align}

When \(\bm Z\) is given, \(R_{\epsilon }^c(\vZ \vert \bm \Pi )\) is a concave function of \(\bm \Pi \). Then the MCR\(^2\) principle for unsupervised learning problems becomes as follows:

\begin{align}\label {eq:MCR pi} \max _{\bm \Pi , \theta } & \ \Delta R_{\epsilon }\big (\vZ \mid \bm \Pi ) \doteq R_{\epsilon }(\bm Z) - R_{\epsilon }^c(\vZ \mid \bm \Pi ) \notag \\ \mathrm {s.t.} & \ \ \ \bm Z = f(\bm X, \theta ),\ \|\bm Z \bm \Pi _k\|_F^2 = N_k,\ k = 1,\dots ,K,\ \bm \Pi \in \Omega . \tag{4.2.17} \end{align}

Compared to (4.2.15), the formulation here allows for the joint optimization of both the group memberships and the network parameters. In particular, when \(\bm \Pi \) is fixed to a group membership matrix that assigns \(N\) data points into \(K\) groups, Problem (4.2.17) can recover Problem (4.2.15).

PIC

Figure 4.16: Local optimization landscape: According to Theorem 4.2, the global maximum of the rate reduction objective corresponds to a solution with mutually incoherent subspaces.

Remark 4.7 (Computing Coding Rates). It is a classical fact that computing the log-determinant of an \(n \times n\) matrix takes \(\cO (n^{3})\) time to compute. Hence computing the coding rates \(\Delta R_{\epsilon }\) or \(R_{\epsilon }\) (etc.) seem intractable at the scale of modern data analysis. There are several ways around this, which we will use without comment in the experiments that power the rest of the book.

4.2.4 Optimization Properties of Coding Rate Reduction

In this subsection, we study the optimization properties of the MCR\(^2\) function by analyzing its optimal solutions and the structure of its optimization landscape. To get around the technical difficulty introduced by the neural networks, we consider a simplified version of Problem (4.2.15) as follows:

\begin{align}\tag{4.2.21}\label {eqn1:maximal-rate-reduction} \max _{\bm Z}\ R_{\epsilon }(\vZ ) - R_{\epsilon }^c(\vZ )\qquad \mathrm {s.t.}\quad \|\bm Z_k\|_F^2 = N_k,\ k =1,\dots ,K. \end{align}

In theory, the MCR\(^2\) principle (4.2.21) benefits from great generalizability and can be applied to representations \(\vZ \) of any distributions as long as the rates \(R_\epsilon \) and \(R^c_\epsilon \) for the distributions can be accurately and efficiently evaluated. The optimal representation \(\vZ ^{\ast }\) should have some interesting geometric and statistical properties. We here reveal nice properties of the optimal representation with the special case of subspaces, which have many important use cases in machine learning. When the desired representation for \(\vZ \) is multiple subspaces, the rates \(R_\epsilon \) and \(R^c_\epsilon \) in (4.2.21) are given by (4.2.10) and (4.2.12), respectively. At the maximal rate reduction, MCR\(^2\) achieves its optimal representations, denoted as \(\vZ ^{\ast } = [\vZ _1^*,\dots ,\vZ _K^*]\) with \(\rank {(\vZ _{k}^*)}\le d_k\). One can show that \(\vZ ^{\ast }\) has the following desired properties (see [YCY+20] for a formal statement and detailed proofs).

Theorem 4.2 (Characterization of Global Optimal Solutions). Suppose \(\vZ ^{\ast } = [\vZ _1^*,\dots ,\vZ _K^*]\) is a global optimal solution of Problem (4.2.21). The following statements hold:

This theorem indicates that the MCR\(^2\) principle promotes embedding of data into multiple independent subspaces (as illustrated in Figure 4.16), with features distributed isotropically in each subspace (except for possibly one dimension). Notably, this theorem also confirms that the features learned by the MCR\(^2\) principle exhibit the desired low-dimensional discriminative properties discussed in Section 4.2.2. In addition, among all such discriminative representations, it prefers the one with the highest dimensions in the ambient space. This is substantially different from the objective of information bottleneck (4.2.5).

Regularized MCR\(^2\). The above theorem characterizes properties of the global optima of the rate reduction objectives. What about other optima, such as local ones? Due to the constraints of the Frobenius norm, it is a difficult task to analyze Problem (4.2.21) from an optimization-theoretic perspective. Therefore, we consider the Lagrangian formulation of (4.2.21). This can be viewed as a tight relaxation or even an equivalent problem of (4.2.21) whose optimal solutions agree under specific settings of the regularization parameter; see [WLP+24, Proposition 3.2]. Specifically, the formulation we study, referred to henceforth as the regularized MCR\(^2\) problem, is as follows:

\begin{align}\tag{4.2.22}\label {eq:MCR-reg} \max _{\bm Z}\ R_{\epsilon }(\vZ ) - R_{\epsilon }^c(\vZ ) - \frac {\lambda }{2}\|\bm Z\|_F^2, \end{align}

where \(\lambda > 0\) is the regularization parameter. Although the program (4.2.22) is highly nonconcave and involves matrix inverses in its gradient computation, we can still explicitly characterize its local and global optima as follows (see [WLP+24, Theorem 3.1] for detailed proofs).

Theorem 4.3 (Local and Global Optima). Let \(N_k\) denote the number of training samples in the \(k\)-th class for each \(k \in \{1,\dots ,K\}\), \(N_{\max } \doteq \max \{N_1,\dots ,N_K\}\), \(\alpha =d/(N\epsilon ^2)\), and \(\alpha _{k} = d/(N_k\epsilon ^2)\) for each \(k \in \{1,\dots ,K\}\). Given a coding precision \(\epsilon > 0\), if the regularization parameter satisfies

\begin{align}\tag{4.2.23}\label {eq:lambda} \lambda \in \left (0, \frac {d(\sqrt {N/N_{\max }}-1)}{N(\sqrt {N/N_{\max }}+1)\epsilon ^2} \right ], \end{align}

then the following statements hold:
(i) (Local maximizers) \(\bm Z^* = \left [\bm Z_1^*,\dots ,\bm Z_K^* \right ]\) is a local maximizer of Problem (4.2.22) if and only if the \(k\)-th block admits the following decomposition

\begin{align}\tag{4.2.24}\label {eq:Zk opti} \bm Z_k^* = \left (\frac { \eta _k + \sqrt {\eta _k^2 - 4\lambda ^2N/N_k}}{2\lambda \alpha _{k}}\right )^{1/2} \bm U_k \bm V_k^\top , \end{align}

where (a) \(r_k = \mathrm {rank}(\bm Z_k^*)\) satisfies \(r_k \in [0,\min \{N_k,d\})\) and \(\sum _{k=1}^K r_k \le \min \{N,d\}\), (b) \(\bm U_k \in \mathcal {O}^{d \times r_k}\) satisfies \(\bm U_k^{\top }\bm U_l = \bm 0\) for all \(k \neq l\), \(\bm V_k \in \mathcal {O}^{N_k \times r_k}\), and (c) \(\eta _k := (\alpha _k-\alpha ) - \lambda (N/N_k+1)\) for each \(k\in \{1,\dots ,K\}\).
(ii) (Global maximizers) \(\bm Z^* = \left [\bm Z_1^*,\dots ,\bm Z_K^* \right ]\) is a global maximizer of Problem (4.2.22) if and only if (a) it satisfies all the above conditions and \(\sum _{k=1}^K r_k = \min \{N,d\}\), and (b) for all \(k \neq l \in [K]\) satisfying \(N_k < N_l\) and \(r_l > 0\), we have \(r_k = \min \{N_k,d\}\).

PIC

Figure 4.17: Global optimization landscape: According to [LSJ+16; SQW15], Theorems 4.3 and 4.4, both global and local maxima of the (regularized) rate reduction objective correspond to a solution with mutually incoherent subspaces. All other critical points are strict saddle points.

This theorem explicitly characterizes the local and global optima of problem (4.2.22). Intuitively, this shows that the features represented by each local maximizer of Problem (4.2.22) are low-dimensional and discriminative. Although we have characterized the local and global optimal solutions in Theorem 4.3, it remains unknown whether these solutions can be efficiently computed using GD to solve the problem (4.2.22), since GD may get stuck at other critical points such as a saddle point. Fortunately, [LSJ+16; SQW15] showed that if a function is twice continuously differentiable and satisfies the strict saddle property, i.e., each critical point is either a local maximizer or a strict saddle point29, GD converges to a local maximizer almost surely with random initialization. We investigate the global optimization landscape of the problem (4.2.22) by characterizing all its critical points as follows (see [WLP+24, Theorem 3.3] for detailed proofs).

Theorem 4.4 (Benign Global Optimization Landscape). Given a coding precision \(\epsilon > 0\), if the regularization parameter satisfies (4.2.23), it holds that any critical point \(\bm Z\) of the problem (4.2.22) is either a local maximizer or a strict saddle point.

Together, the above two theorems show that the learned features associated with each local maximizer of the rate reduction objective — not just global maximizers — are structured as incoherent low-dimensional subspaces. Furthermore, the (regularized) rate reduction objective (4.2.15) has a very benign landscape with only local maxima and strict saddles as critical points, as illustrated in Figure 4.17. According to [LSJ+16; SQW15], Theorems 4.3 and 4.4 imply that low-dimensional and discriminative representations (LDRs) can be efficiently found by applying (stochastic) gradient descent to the rate reduction objective (4.2.15) from random initialization. These results also indirectly explain why in Section 4.3.1, if the chosen network is expressive enough and trained well, the resulting representation typically gives an incoherent linear representation that likely corresponds to the globally optimal solution. Interested readers are referred to [WLP+24] for proofs.

4.3 Learning Representations for Imagery Data

4.3.1 Supervised Representation Learning

We here present how the MCR\(^2\) objective (4.2.15) helps learn better representations than the cross entropy (CE) (4.2.2) for image classification. Here, for simplicity,30 we adopt the popular neural network architecture, the ResNet-18 [HZR+16b], to model the feature mapping \(\vz = f(\vx ,\theta )\). We optimize the neural network parameters \(\theta \) via the backpropagation algorithm (detailed in Section A.2.3) to maximize the coding rate reduction. More implementation details of this experiment can be found in [CYY+22].

Figure 4.18 illustrates how the two rates and their difference (for both training and test data) evolve over epochs of training: After an initial phase, \(R_\epsilon \) gradually increases while \(R^c_\epsilon \) decreases, indicating that features \(\bm Z\) are expanding as a whole while each class \(\bm Z_k\) is being compressed.

We first evaluate the performance with the CIFAR-10 image classification dataset [KH+09] and the results are shown in the table of Figure 4.19. As we can see, the MCR\(^2\) objective achieves a comparable performance as the cross entropy (CE) loss. In addition, the MCR\(^2\) objective is more robust to corruptions in the class labels since it learns a joint (subspace) representation for each class instead of trying to fit the labels of individual samples.

PIC

Figure 4.18: Evolution of \(R_\epsilon \), \(R^c_\epsilon \), \(\Delta R_\epsilon \) during the training process.

PIC

Figure 4.19: Comparison of classification accuracy between training with the cross entropy (CE) loss or the MCR\(^2\) objective, with different ratios of the class labels being randomly corrupted.

PIC

(a) PCAs of features for individual classes learned by MCR\(^2\).

PIC

(b) PCAs of features for individual classes learned by Cross Entropy.
Figure 4.20: Comparison of the principal components of learned features from MCR\(^2\) versus those from cross entropy.

PIC PIC

Figure 4.21: Cosine similarity between learned features by using the MCR\(^2\) objective (left) and CE loss (right).

Figure 4.20 shows the distribution of singular values of the learned features \(\vZ _k\) per class. From the distribution of the singular values shown in Figure 4.19a, we see that the features \(\vz \) learned from MCR\(^2\) for each class are distributed on a subspace of dimension around 10-12, whereas the features from cross entropy Figure 4.19b are of much lower dimension.31

Figure 4.21 shows the cosine similarities between the learned features sorted by class. We also compare the similarities of the learned features by using the cross-entropy (CE) (4.2.2) and the MCR\(^2\) objective (4.2.15). From the plots, one can clearly see that the features learned by the MCR\(^2\) objective for different classes are much more incoherent (or independent) than features learned by using cross-entropy loss. Within each class, the features also are less coherent hence more diverse, as indicated by the distribution of their singular values shown in Figure 4.19a.

We visualize the images that are mapped along different singular vectors in the feature subspace of each class, which we call the “principal” images. Figure 4.22 shows these principal images along each of the top-10 singular vectors. As we can see, the images along each singular vector are very similar in their appearances and structures and images along different singular vectors are very distinct from each other.

PIC

Figure 4.22: Top-10 “principal” images for class - “Bird” and “Ship” in the CIFAR-10 image dataset.

Despite the fact that the MCR\(^2\) objective seems to allow us to learn better representations, there has been an apparent lack of justification of the network architectures used in the above experiments. It is yet unclear why the network adopted here (the ResNet-18) is suitable for representing the map \(f(\vx , \theta )\), let alone for interpreting the layer operators and parameters \(\theta \) learned inside. In Chapter 5, we will show how to derive network architectures and components entirely as a “white box” from the desired objective (say the rate reduction).

4.3.2 Unsupervised Representation Learning

As we have seen in the above section, the class information makes learning a structured representation of the (mixed) data distribution relatively easy. However, such information is often not available, at least not at a massive scale.

Now we present how to extend the MCR\(^2\) principle to the unsupervised setting. Contrary to supervised learning where the class labels are known, in unsupervised learning, the group memberships of the data samples are unknown. In this case, we can apply certain class of augmentations or transformations, say \(\tau \) with a distribution \(P_{\tau }\), to each sample. For example, for images, augmentations typically include translations, rotations, or cropping of each image sample. We may view the augmented samples of each data point as belonging to the same group or cluster.

More specifically, given a set of \(N\) data samples \(\vX = [\vx _1,\dots ,\vx _N]\), we generate \(m\) augmented views \(\{\tau _i\}_{i=1}^m\) for each data point \(\vx _k\) , denoted as \(\tilde {\vX }_{k} = [\tau _1(\vx _k), \tau _2(\vx _k), \dots , \tau _m(\vx _k)]\). We treat the augmented samples \(\tilde {\vX }_{i}\) as the \(i\)-th class with \(K=N\) groups/clusters in total. Now, we can apply the MCR\(^2\) principle as we did before in the supervised learning case. Specifically, we learn the feature mapping \(f(\vx ,\theta )\) by maximizing the coding rate reduction as follows:

\begin{equation} \begin {aligned} \max _{\theta } & \; \Delta R_{\epsilon }\big (\vZ \big ) \doteq R_{\epsilon }(\vZ ) - R_{\epsilon }^c(\vZ ), \\ \mbox {s.t.} & \ \ \, \vZ = f(\tilde {\vX }, \theta ),\ \|\vZ _k\|_F^2 = m,\ k=1,\dots ,N. \end {aligned} \tag{4.3.1}\label {eqn:maximal-rate-reduction-unsupervised} \end{equation}

Here, \(\tilde {\vX } = [\tilde {\vX }_1,\dots ,\tilde {\vX }_N]\) and \(\vZ _k = f(\tilde {\vX }_k,\theta )\) denotes the features of the augmented samples of \(\vx _k\) for each \(k=1,\dots ,N\).

Let us again adopt ResNet-18 as the feature mapping \(f(\vx ,\theta )\) and optimize the network parameters \(\theta \) to maximize the unsupervised objective (4.3.1). After training on the CIFAR-10 dataset, we visualize the evolution of the two rates and their difference (for both training and test data) over epochs of training in Figure 4.22a. Similar to the supervised learning case, after an initial phase, \(R_\epsilon \) gradually increases while \(R^c_\epsilon \) decreases, indicating that features \(\bm Z\) are expanding as a whole while each cluster \(\bm Z_k\) is being compressed. Compared to the supervised case shown in Figure 4.18, we can observe that, in the unsupervised case, the total coding rate \(R_\epsilon \) expands rapidly and the overall MCR\(^2\) objective saturates at a local optimum. To alleviate this, we can alter the optimization strategy by controlling the dynamics of \(R_\epsilon \) and \(R^c_\epsilon \). Specifically, we can rescale the rates by replacing the coding rate \(R_\epsilon (\vZ )\) in (4.3.1) with

\[\tilde {R}_\epsilon (\vZ ) \doteq \frac {1}{2\gamma _1} \log \det \left (\I + \frac {\gamma _2 d}{N m \epsilon ^{2}}\vZ \vZ ^{\top }\right ).\]

By setting \(\gamma _1 = \gamma _2 = N\), we arrive at a modified learning objective, referred to as MCR\(^2\)-CTRL, whose training dynamics are shown in Figure 4.22b. As we can see, with this controlled MCR\(^2\) objective, features are first compressed and then gradually expanded, leading to a higher value of the overall objective \(\Delta R_{\epsilon }(\vZ )\).

To assess the quality of the learned representations and verify whether they have good subspace structures, we adopt a subspace clustering algorithm EnSC [YLR+16]. For evaluation metrics, we consider normalized mutual information (NMI), clustering accuracy (ACC), and adjusted rand index (ARI) 32. For comparison, we consider methods including JULE [YPB16], RTM [NMM19], DEC [XGF16], DAC [CWM+17], and DCCM [WLW+19]. The clustering results are summarized in Figure 4.24. As we can see, the features learned by maximizing the MCR\(^2\)-CTRL objective achieve better clustering accuracy than those learned by other algorithms.

PIC

(a) MCR\(^2\)

PIC

(b) MCR\(^2\)-ctrl
Figure 4.23: Evolution of the rates of MCR\(^2\) in the training process for unsupervised learning on CIFAR-10.

PIC

Figure 4.24: Comparison of clustering performance accuracy between training with the MCR\(^2\)-CTRL objective and other unsupervised learning algorithms on CIFAR-10, CIFAR-100, and STL-10 datasets.

There is one main drawback though: the MCR\(^2\)-based unsupervised learning method requires a costly \(\log \det \) operation of \(\O (d^3)\) complexity per sample when computing \(R^c_\epsilon \). This quickly becomes computationally prohibitive as we try to scale to larger datasets. Instead of minimizing the exact coding rate for each group of augmented samples \(\tilde {\vX }_{i}\), we can approximate the effect of compressing features in each group by maximizing the cosine similarity between the features of the augmented samples:33

\begin{equation} \begin {aligned} \max _{\theta } & \; R_{\epsilon }(\vZ ) + \frac {\gamma }{N}\sum _{k=1}^N \sum _{i \neq j} \vz _{k,j}^\top \vz _{k,i}, \\ \mbox {s.t.} & \ \ \, \vz _{i,j} = f(\tau _j(\vx _i),\theta ),\ \|\vz _{i,j}\|_2^2 = 1, \end {aligned} \tag{4.3.2}\label {eqn:maximal-rate-reduction-unsupervised-simdino} \end{equation}

where \(\gamma > 0\) is a hyperparameter that controls the strength of the cosine similarity term, as it differs from the coding rate in magnitude. This objective avoids the costly \(\log \det \) operation and is much more scalable. In fact, this objective is closely related to popular (unsupervised) contrastive learning methods such as SimCLR [CKN+20] and DINO [CTM+21], which have been engineered to learn high-quality representations from massive amount of images.

As we will show later in Section 8.2.2 of Chapter 8, this objective (4.3.2) can be implemented very efficiently to learn a highly structured representation \(\vZ \) that is highly linear and discriminative. The learned feature representation leads to state-of-the-art performance on large real-world datasets comparable or even better than features learned by DINO, but without the heavy engineering efforts needed for DINO. Hence, we refer to this objective as Simplified DINO (SimDINO). As a preview, Table 4.1 show that SimDINO outperforms DINO on ImageNet-1k classification using both \(k\)-nearest neighbor accuracy and linear probing (i.e. training a linear classifier on pretrained models). This demonstrates that the objective leads to highly structured and linearized representations. We can also utilize these representations for downstream tasks such as unsupervised segmentation34, as illustrated in Figure 4.25. More implementation details and experimental results of SimDINO will be presented in Chapter 8.

Method Model Epochs 20-NN Linear Probing
DINO ViT-B 100 72.9 76.3
SimDINO ViT-B 100 74.9 77.3
DINO ViT-L 100
SimDINO ViT-L 100 75.6 77.4
Table 4.1: Classification performance on ImageNet-1k test set for DINO and SimDINO, using both \(k\)-nearest neighbor accuracy (\(k = 20\)) and linear probing. SimDINO outperforms DINO in both accuracy and training stability. DINO training on ViT-L is unstable and quickly diverges, while SimDINO trains stably.

PIC

Figure 4.25: Visualization of unsupervised segmentation results from DINO ViT-B/16 (row 1), SimDINO ViT-B/16 (row 2) and SimDINO ViT-L/16 (row 3).

4.4 Summary and Notes

Key messages. In this Chapter, we have studied basic concepts and ideas behind how to learn a distribution from finite samples and obtain a computable encoding and decoding scheme for the distribution. For distributions that are continuous and possibly low-dimensional, we have argued that such a scheme is necessarily lossy – hence the importance of the concept of rate distortion. The noise introduced through distortion, due to quantization, plays a related but different role from the noise we have studied in the preceding chapter. One may think of the noise considered in Chapter 3 as “outside” of the submanifold of the support of the distribution whereas the noise considered in this chapter is “inside” the support. Informally, we may refer to the diffusion/denoising process outside of the support as “continuation” and the diffusion/denoising process inside the support as “percolation.” A more unified analysis and understanding of their compounded effect is due for future study.

In this chapter, we introduced for the first time the concept of learning a representation for a distribution. We have considered arguably the simplest case, the case of classification. If one views the classification label \(\y \) as a fixed simple (grossly lossy) code for the data \(\vx \), we may seek a consistent35 representation \(\vz = f(\vx )\) of the data \(\vx \) such that it tends to maximize the mutual information between the representation and this simple code:

\begin{equation} \max I(\vz , \y ) = H(\vz ) - H(\vz \mid \y ). \tag{4.4.1} \end{equation}

In addition, we require such a representation \(\vz \) is not only consistent to \(\vx \) but also much more structured than the native data distribution of \(\vx \) – classes are linearly discriminative. This would make the learned representation much more efficient to use than the native data distribution for subsequent tasks such as recognition, segmentation and completion. The maximal coding rate reduction objective introduced is a natural and computable realization of these requirements. In Chapter 7, we will see how to generalize this objective to a much broader context when \(\y \) could be a more general lossy code (say image caption) or a random variable related to \(\vx \).

4.5 Exercises and Extensions

Exercise 4.1. Please show the following properties of the \(\log \det (\cdot )\) function.

1.
Show that
\begin{align*} f(\vX ) = \log \det \left (\vX \right ) \end{align*}

is a concave function. (Hint: The function \(f(\vx )\) is concave if and only if the one-dimensional function \(f(\vx +t\vh )\) is concave for all \(\vx \) and \(\vh \).)

2.
Show that
\begin{align*} \log \det (\vI + \vX ^\top \vX ) = \log \det (\vI + \vX \vX ^\top ) \end{align*}
3.
Let \(\vA \in \R ^{n\times n}\) be a positive definite matrix. Please show that:
\begin{align} \log \det \left (\vA \right ) = \sum _{i=1}^n \log (\lambda _i), \tag{4.5.1} \end{align}

where \(\lambda _1,\lambda _2,\dots ,\lambda _n\) are the eigenvalues of \(\vA \).

4.
Show that the gradient of the \(\log \det (\cdot )\) function is given by
\begin{align*} \nabla \log \det (\vA ) = \vA ^{-1}, \end{align*}

for a positive definite matrix \(\vA \).

Exercise 4.2. Let \(\bm Z = [\bm Z_1, \dots , \bm Z_K] \in \mathbb {R}^{d \times N}\), where each \(\bm Z_k \in \mathbb {R}^{d \times N_k}\) and \(N = \sum _{k=1}^K N_k\).

1.
Please show that
\begin{align*} \log \det \left ( \bm I + \alpha \bm Z \bm Z^\top \right ) \le \sum _{k=1}^K \log \det \left ( \bm I + \alpha \bm Z_k \bm Z_k^\top \right ), \end{align*}

where \(\alpha > 0\).

2.
Please derive the necessary and sufficient condition under which the equality holds.

Exercise 4.3. Let \(\vz _1\) and \(\vz _2\) be two vectors of unit length. Let \(\vZ = [\vz _1, \vz _2]\) Show that the following three objectives are equivalent:

However, they may no longer be exactly equivalent when there are more than two vectors in a high-dimensional space. The \(\log \det \) is more general as it measures the total volume spanned by all the vectors in the high-dimensional space.

Exercise 4.4. Please derive the rate-distortion function for the Gaussian random vector introduced in Example 4.4.