Chapter 5 Consistent and Self-Consistent Representations

In this chapter

Learning Consistent Representations
Linear Autoencoding via PCA Nonlinear PCA and Autoencoding Sparse Autoencoding Variational Autoencoding
Learning Self-Consistent Representations
Closed-Loop Transcription via Stackelberg Games A Mixture of Low-Dimensional Gaussians
Continuous Learning Self-Consistent Representations
Class-wise Incremental Learning Sample-wise Continuous Unsupervised Learning
Summary and Notes
Exercises and Extensions

“Everything should be made as simple as possible, but not any simpler.”

– Albert Einstein

In the past chapters, we have established a basic fact that the fundamental goal of learning is to learn a data distribution with low-dimensional support and transform it into a compact and structured representation. Such a representation reveals intrinsic low-dimensional structures of the data distribution and facilitates subsequent tasks such as classification and generation.

A fundamental approach to learning a good representation of such a distribution is through compression. To make the goal of compression measurable and computable, it can be done explicitly by learning a coding scheme that minimizes the coding rate (entropy) or maximizes the information gain (coding rate reduction). In this context, the fundamental role of a deep neural network is to realize a certain iterative optimization algorithm that incrementally optimizes the learned representations in terms of those measures:

f\colon\bm{X}\xrightarrow{\hskip 2.84526ptf^{0}\hskip 2.84526pt}\bm{Z}^{0}\rightarrow\cdots\rightarrow\bm{Z}^{\ell}\xrightarrow{\hskip 2.84526ptf^{\ell}\hskip 2.84526pt}\bm{Z}^{\ell+1}\rightarrow\cdots\to\bm{Z}^{L}=\bm{Z}.

(5.0.1)

In the preceding chapter, we have shown that main architectural characteristics of almost all popular deep networks (ResNet, CNN, and Transformer) can be derived and interpreted from this perspective.

However, when we try to achieve a certain objective through optimization, there is no guarantee that the solution $\bm{Z}$ found in the end by incremental optimization is the correct solution. In fact, even if the optimization process manages to find the globally optimal solution $\bm{Z}^{*}$ , there is no guarantee that the solution corresponds to a complete representation of the data distribution.¹¹1This could be due to many reasons: for example, the data available for learning the distribution might not be sufficient, or the formulation of the optimization program fails to consider some additional constraints or conditions. Therefore, an outstanding question is: how can we ensure that the learned representation of the data distribution is correct or good enough?

Of course, the only way to verify this is to see whether there exists a decoding map, say $g$ , that can decode the learned representation to reproduce the original data (distribution) well enough:

\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}

(5.0.2)

in terms of some measure of similarity:

d(\bm{X},\hat{\bm{X}}).

(5.0.3)

This leads to the concept of consistent representation. As we have briefly alluded to in Chapter 1 (see Figure 1.23), autoencoding, which integrates the encoding and decoding processes, is a natural framework for learning such a representation. We have studied some important special cases in Chapter 2. In this chapter, Section 5.1, we will study how to extend autoencoding to more general classes of distributions by enforcing the consistency between $\bm{X}$ and $\hat{\bm{X}}$ .

In many practical and natural learning scenarios, it can be difficult or even impossible to compare distributions of the data $\bm{X}$ and $\hat{\bm{X}}$ . We are left with the only option of comparing the learned feature $\bm{Z}$ with its image $\hat{\bm{Z}}$ under the encoder $f$ :

\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\hat{\bm{Z}}?

(5.0.4)

This leads to the notion of a self-consistent representation. In Section 5.2, we will study when and how we can learn a consistent representation by enforcing the self-consistency between $\bm{Z}$ and $\hat{\bm{Z}}$ only through a closed-loop transcription framework.

Furthermore, in many practical and natural learning scenarios, we normally do not have sufficient samples of the data distribution all available at once. For example, animals and humans develop their visual memory by continuously taking in increments of observations throughout their lives. In Section 5.3, we will study how to extend the closed-loop transcription framework to learn a self-consistent representation in a continuous learning setting.

Of course, a fundamental motivation for why we ever want to identify the low-dimensional structures in a data distribution and find a good representation is to make it easy to use the data for various tasks of intelligence, such as classification, completion, and prediction. Therefore, the resulting joint representation $(\bm{x},\bm{z})$ must be structured in such a way that is best suited for these tasks. In the next chapter, we will see how the learned representation can be structured to facilitate conditioned completion or generation tasks.

5.1 Learning Consistent Representations

Here we give a formal definition of consistent representations, which are closely related to the concept of autoencoding.

Definition 5.1 (Consistent Representations).

Given data $\bm{X}$ , a consistent representation is a pair of functions $(f\colon\mathcal{X}\to\mathcal{Z},g\colon\mathcal{Z}\to\mathcal{X})$ , such that the features $\bm{Z}=f(\bm{X})$ are compact and structured, and the autoencoding

\hat{\bm{X}}\doteq g(\bm{Z})=g(f(\bm{X}))

(5.1.1)

is close to $\bm{X}$ according to either the following two measures:

1.

We say that it is sample-wise consistent if $\bm{X}\approx\hat{\bm{X}}$ in a certain norm with high probability.
2.

We say that the representation is distributionally consistent if $\operatorname{Law}(\bm{X})\approx\operatorname{Law}(\hat{\bm{X}})$ .

Astute readers may have noticed that if we do not impose certain requirements on the representation $\bm{Z}$ sought, the above problem has a trivial solution: one may simply choose the functions $f$ and $g$ to be the identity map! Hence, the true purpose of seeking an autoencoding is to ensure that the $\bm{Z}$ so obtained is both more compact and more structured than $\bm{X}$ . First, for compactness, $\bm{Z}$ should better reveal the intrinsic low-dimensionality of $\bm{X}$ . Therefore, the representation should maximize a certain information gain, say, measured by the rate reduction

\Delta R_{\epsilon}(\bm{Z})

(5.1.2)

introduced in Section 3.4.2. Second, the main purpose of learning a good representation of the data distribution is to facilitate tasks that exploit the low-dimensionality of its distribution. Hence, the distribution of $\bm{Z}$ should be better structured. For example, the distribution of $\bm{Z}$ is piecewise linear or Gaussian, and its components are largely incoherent or independent. These independent components can represent different clusters or classes and can also easily be used as conditions for decoding the corresponding data $\bm{x}$ .

From the definition of consistent representation, it is required that the representation $\bm{Z}$ be sufficient to recover the original data distribution $\bm{X}$ to some degree of accuracy. For sample-wise consistency, a typical choice is to minimize the expected reconstruction error:

d(\bm{X},\hat{\bm{X}})=\mathbb{E}[\|\bm{X}-\hat{\bm{X}}\|_{2}^{2}].

(5.1.3)

For consistency in distribution, a typical choice is to minimize a certain distributional distance such as the KL divergence²²2Note that for distributions without common support, which is typical for degenerate distributions, KL divergence may not even be well-defined. In fact, much of the distribution learning literature is trying to address this technical difficulty by replacing or approximating it with something well-defined and efficiently computable.:

d(\bm{X},\hat{\bm{X}})=\mathcal{D}_{KL}(\bm{X}\|\hat{\bm{X}}).

(5.1.4)

Hence, computation aside, when we seek a good autoencoding for a data distribution $\bm{X}$ , conceptually we try to find an encoder $f$ and decoder $g$ such that

\min_{f,g}[-\Delta R_{\epsilon}(\bm{Z})+d(\bm{X},\hat{\bm{X}})].

(5.1.5)

For the rest of this chapter, we will study how to solve such autoencoding problems under different conditions, from simple and ideal cases to increasingly more challenging and realistic conditions.

5.1.1 Linear Autoencoding via PCA

According to [Bal11], the phrase “autoencoder” was first introduced by Hinton and Rumelhart [RHW86] so that a deep representation could be learned via back propagation (BP) in a self-supervised fashion—reconstructing the original data is the self-supervising task. However, the very same concept of seeking a compact and consistent representation has its roots in many classic studies. As we have already seen in Chapter 2, the classical PCA, ICA, and sparse dictionary learning all share a similar goal. The only difference is that when the underlying data distribution is simple (linear and independent), the encoding or decoding mappings become easy to represent and learn: they do not need to be deep and often can be computed in closed form or with an explicit algorithm.

It is instructive to see how the notion of consistency we have defined plays out in the simple case of PCA: here, the consistent encoding and decoding mappings are given by a single-layer linear transform:

\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}=\bm{U}^{\top}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\mathcal{D}=\bm{U}\hskip 5.69054pt}\hat{\bm{X}},

(5.1.6)

where $\bm{U}\in\mathbb{R}^{D\times d}$ typically with $d\ll D$ . Hence $\bm{U}^{\top}$ represents a projection from a higher-dimensional space $\mathbb{R}^{D}$ to a lower one $\mathbb{R}^{d}$ , as illustrated in Figure 5.1.

Figure 5.1 : Illustration of a typical autoencoder such as PCA, seeking a low-dimensional representation 𝒛 \bm{z} bold_italic_z of high-dimensional data 𝒙 \bm{x} bold_italic_x . — Figure 5.1: Illustration of a typical autoencoder such as PCA, seeking a low-dimensional representation $\bm{z}$ of high-dimensional data $\bm{x}$ .

As we saw in Chapter 2, when the distribution of $\bm{x}$ is indeed supported on a low-dimensional subspace $\bm{U}_{o}$ , the compactness of the representation $\bm{z}$ produced by $\mathcal{E}$ is a direct consequence of correctly estimating (and enforcing) the dimension of this subspace. Finally, recall that gradient descent on the reconstruction criterion exactly yields these sample-wise consistent mappings: indeed, the optimal solution to the problem

\min_{\bm{U}^{\top}\bm{U}=\bm{I}}\,\mathbb{E}_{\bm{x}}\left[\left\|\bm{x}-\bm{U}\bm{U}^{\top}\bm{x}\right\|_{2}^{2}\right]

(5.1.7)

precisely coincides with $\bm{U}_{o}$ when the dimension of the representation is sufficiently large. In this case, we obtain sample-wise consistency for free, since this guarantees that $\bm{U}_{o}\bm{U}_{o}^{\top}\bm{x}=\bm{x}$ . Notice that in the case of PCA, the rate reduction term in (5.1.5) becomes void as the regularization on the representation $\bm{Z}$ sought is explicit: It spans the entire subspace of lower dimension³³3One may view $\Delta R=0$ in this case..

Online PCA.

Notice that in the above construction, the linear transform $\bm{U}$ used for encoding and decoding is computed “offline” from all the input data beforehand. One question is whether this transform can be learned “online” as the input data comes in order. This question was answered by the work of Oja in 1982 [Oja82].

Example 5.1 (Normalized Hebbian learning scheme for PCA).

Consider a sequence of i.i.d. random vectors $\bm{x}_{1},\ldots,\bm{x}_{i},\ldots\in\mathbb{R}^{n}$ with covariance $\bm{\Sigma}\in\mathbb{R}^{n\times n}$ . Let $\bm{u}_{0}\in\mathbb{R}^{n}$ and define the response of an input vector $\bm{x}_{i}$ against a weight vector $\bm{u}_{i}$ to be their inner product:

\eta_{i}=\bm{u}_{i}^{T}\bm{x}_{i}

(5.1.8)

and we update the weight vector according to the following scheme:

\bm{u}_{i+1}=\frac{\bm{u}_{i}+\gamma\eta_{i}\bm{x}_{i}}{\|\bm{u}_{i}+\gamma\eta_{i}\bm{x}_{i}\|_{2}}

(5.1.9)

for some small gain $\gamma>0$ . This update scheme can be viewed as a normalized Hebbian scheme, in which the weights of connections between neurons become stronger if (products of) both the input $\bm{x}$ and output $\eta$ are strong. One may view the vector of weights $\bm{u}$ as “learned” based on a form of feedback from the output $\eta$ . Then, under reasonable assumptions, Oja [Oja82] has shown that $\bm{u}_{i}$ converges to the eigenvector associated with the largest eigenvalue of $\bm{\Sigma}$ . $\blacksquare$

The normalized Hebbian scheme (5.1.9) can be interpreted as a first-order approximation to a stochastic projected gradient descent scheme on the objective of the problem (5.1.7) (with batch size $1$ , and with the number of columns of $\bm{U}$ equal to $1$ ) as long as $\|\bm{u}\|_{2}=1$ , which is maintained by the projection operation in (5.1.9). It is worth keeping its existence in the back of one’s mind, both as a proof of correctness for the use of stochastic gradient methods for optimizing reconstruction costs such as (5.1.7), and for its suggestion that simpler algorithms than (end-to-end) back propagation can succeed in learning consistent autoencoders.

Figure 5.2: A depiction of interpolation through manifold flattening on a manifold in

\mathbb{R}^{3}

of dimension

d=2

. To interpolate two points on the data manifold, map them through the flattening map

f

to the flattened space, take their convex interpolants, and then map them back to the data manifold through the reconstruction map

g

5.1.2 Nonlinear PCA and Autoencoding

Of course, one should expect that things will no longer be so simple when we deal with more complex distributions whose underlying low-dimensional structures are nonlinear.

Data on a Nonlinear Submanifold.

So, to move beyond the linear structure addressed by PCA, we may assume that the data distribution lies on a (smooth) submanifold $\mathcal{M}$ . The intrinsic dimension of the submanifold, say $d$ , is typically much lower than the dimension of the ambient space $\mathbb{R}^{D}$ . From this geometric perspective, we typically want to find a nonlinear mapping $f$ such that the resulting manifold $f(\mathcal{M})$ is flattened, as illustrated by the example shown in Figure 5.2. The resulting feature $\bm{z}$ -space is typically more compact (of lower-dimension) than the $\bm{x}$ -space, and the manifold is flat. From the statistical perspective, which is complementary to the geometric perspective but distinct in general, we may also want to ensure that the data distribution on $\mathcal{M}$ is mapped to a sufficiently regular distribution, say a Gaussian or uniform distribution (with very low-dimensional support), in the $\bm{z}$ -space. These two properties ensure that sampling and interpolation in the $\bm{z}$ -space are as easy as possible, and they are mathematical formalizations of the desirable notions of compact and structured features in the low-dimensional manifold model for the data distribution. In general, the problem of learning such an autoencoding mapping for this class of data distributions is known as nonlinear principal component analysis (NLPCA).

A Classical Attempt via a Two-Layer Network.

As we have seen above, in the case of PCA, a single-layer linear neural network is sufficient. That is no longer the case for NLPCA. In 1991, Kramer [Kra91] proposed to solve NLPCA by using a two-layer neural network to represent the encoder mapping $f$ (or its inverse $g$ ) based on the universal representation property of two-layer networks with sigmoid activation:

\bm{z}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{x}+\bm{b}),

(5.1.10)

where $\sigma(\,\cdot\,)$ is the sigmoid function:

\sigma(x)=\frac{1}{1+e^{-x}}.

(5.1.11)

Cybenko [Cyb89] showed that functions of the above form (with enough hidden nodes) can approximate any smooth nonlinear function, say the encoder $f(\,\cdot\,)$ , to an arbitrary precision. In particular, they can represent the flattening and reconstruction maps for data distributions supported on unions of low-dimensional manifolds, as in Figure 5.2. The overall architecture of the original networks proposed by Kramer is illustrated in Figure 5.3.

Figure 5.3 : Nonlinear PCA by autoassociative neural networks of depth two for both the encoding and decoding mappings, suggested by Kramer [ Kra91 ] . — Figure 5.3: Nonlinear PCA by autoassociative neural networks of depth two for both the encoding and decoding mappings, suggested by Kramer [Kra91].

Unfortunately, unlike the case of PCA above, there is in general no closed-form learning scheme for the parameters $\bm{\theta}=(\bm{W},\bm{b})$ of these networks. Hence it was proposed to train the network via back propagation with the supervision of reconstruction error:

\min_{\bm{\theta}}\mathbb{E}[\|\bm{x}-\hat{\bm{x}}(\bm{\theta})\|_{2}^{2}].

(5.1.12)

Compared to the simple case of PCA, we utilize the same reconstruction objective for learning, but a far more complex nonlinear class of models for parameterizing and learning the encoder and decoder. Although universal approximation properties such as Cybenko’s suggest that in principle learning consistent autoencoders via this framework is possible—because for any random sample of data, given enough parameters, such autoencoding pairs exist—one often finds it highly nontrivial to find them using gradient descent. Moreover, to obtain a sufficiently informative reconstruction objective and model for the distribution of high-dimensional real-world data such as images, the required number of samples and hidden nodes can be huge. In addition, as a measure of the compactness of the learned representation, the (lower) dimension of $\bm{z}$ for the bottleneck layer is often chosen heuristically.⁴⁴4In the later work [HZ93], Hinton et al. suggested to use the minimum description length (MDL) principle to promote the compactness of the learned coding scheme, in a spirit very similar to the rate distortion measure introduced in this book.

Manifold Flattening via a Deeper Network.

Based on the modern practice of deep networks, such classical shallow and wide network architectures are known to be rather difficult to train effectively and efficiently via back propagation (BP), partly due to the vanishing gradient of the sigmoid function. Hence, the modern practice normally suggests further decomposing the nonlinear transform $f$ (or $g$ ) into a composition of many more layers of simpler transforms, resulting in a deeper network architecture [HS06], as illustrated in Figure 5.4. In the modern context, further elaborations over the basic reconstruction cost (5.1.12) also prove necessary to achieve good performance on complex real-world data distributions such as images.

Figure 5.4: A depiction of the construction process of the flattening and reconstruction pair

(f,g)

, where the encoder

f=f_{L}\circ f_{L-1}\circ\cdots\circ f_{1}

is constructed from composing flattening layers, and the decoder

g=g_{1}\circ g_{2}\circ\cdots\circ g_{L}

is composed of inversions of each

f_{\ell}

In light of universal approximation theorems such as Cybenko’s, one may initially wonder why, conceptually, deeper autoencoders should be preferred over shallow ones. From purely an expressivity perspective, we can understand this phenomenon through a geometric angle related to the task of flattening the nonlinear manifold on which our hypothesized data distribution is supported. A purely constructive approach to flattening the manifold proceeds incrementally, in parallel to what we have seen in Chapters 3 and 4 with the interaction between diffusion, denoising, and compression. In the geometric setting, the incremental flattening process corresponding to $f_{\ell}$ takes the form of transforming a neighborhood of one point of the manifold to be closer to a flat manifold (i.e., a subspace), and enforcing local consistency with the rest of the data samples; the corresponding incremental operation in the decoder, $g_{\ell}$ , undoes this transformation. This procedure precisely incorporates curvature information about the underlying manifold, which is estimated from data samples. Given enough samples from the manifold and a careful instantiation of this conceptual process, it is possible to implement this procedure as a computational procedure that verifiably flattens nonlinear manifolds in a white-box fashion [PPR+24]. However, the approach is limited in its applicability to high-dimensional data distributions such as images due to unfavorable scalability, motivating the development of more flexible methods to incremental autoencoding.

5.1.3 Sparse Autoencoding

In the above autoencoding schemes, the dimension of the feature space $d$ is typically chosen to be much lower than that of the original data space $D$ in order to explicitly enforce or promote the learned representation to be low-dimensional. However, in practice, we normally do not know the intrinsic dimension of the data distribution. Hence, the choice of the feature space dimension for autoencoding is often done empirically. In more general situations, the data distribution can be a mixture of a few low-dimensional subspaces or submanifolds. In these cases, it is no longer feasible to enforce a single low-dimensional space for all features together.

The sparse autoencoder is meant to resolve some of these limitations. In particular, the dimension of the feature space can be equal to or even higher than that of the data space, as illustrated in Figure 5.5. However, the features are required to be highly sparse in the feature space. So if we impose sparsity as a measure of parsimony in addition to rate reduction in the objective (5.1.5), we obtain a new objective for sparse autoencoding:

\min_{f,g}[\|\bm{Z}\|_{0}-\Delta R_{\epsilon}(\bm{Z})+d(\bm{X},\hat{\bm{X}})],

(5.1.13)

where the $\ell^{0}$ -“norm” $\|\,\cdot\,\|_{0}$ is known to promote sparsity. This is very similar to the sparse rate reduction objective (4.2.4) which we used in the previous Chapter 4 to derive the white-box CRATE architecture.

Figure 5.5 : Illustration of a sparse autoencoder (SAE), compared to that of a typical autoencoder (AE) in Figure 5.1 . — Figure 5.5: Illustration of a sparse autoencoder (SAE), compared to that of a typical autoencoder (AE) in Figure 5.1.

As a method for learning autoencoding pairs in an end-to-end fashion, sparse autoencoding has been practiced in the past [RPC+06, LRM+12], but nearly all modern autoencoding frameworks are instead based on a different, probabilistic autoencoding framework, which we will study now.

5.1.4 Variational Autoencoding

In the classical conception of autoencoding, following Hinton and Rumelhart [RHW86], the data distribution plays a very minor role in the formulation, despite its centrality to the representation we ultimately learn. Indeed, in the naive framework, one hopes that by training a deep network to reconstruct samples from the data distribution with a suitably configured bottleneck for the representation $\bm{z}$ , the learned encoders $f$ and $g$ will naturally end up corresponding to a compact and structured feature representation for the data. This is rarely the case in practice. An improved, more modern methodology for autoencoding that still finds significant application to this day is variational autoencoding [KW13, KW19]. We will see how this framework, which trains a variational autoencoder (VAE) derived through probabilistic modeling considerations, both generalizes the classical autoencoder training (via minimization of the reconstruction loss), and stabilizes it with appropriate regularization. Later, we will see how to go further and go beyond the black-box nature of the deep networks used to represent the encoding and decoding mappings $f$ and $g$ .

Probabilistic Perspective on Autoencoding

In the manifold model for the data distribution, the key mathematical objects are the support of the data distribution, namely the manifold $\mathcal{M}$ , and the density of the data on the support, say $p$ . When we formulate autoencoding from the probabilistic perspective, we often think of the high-dimensional input $\bm{x}$ as having a density $p$ with support on $\mathbb{R}^{D}$ ; one can think of adding a very small amount of noise to the (degenerate) distribution supported on the manifold $\mathcal{M}$ to obtain this density $p$ , in line with our denoising-diffusion constructions in Chapter 3. Then the goal of generative probabilistic modeling is to learn the density $p$ from samples $\bm{x}$ , say from a class of models $p(\bm{x};\,\bm{\theta})$ parameterized by $\bm{\theta}$ . As we have recalled in Chapter 3, a classical approach to achieving this is via maximum likelihood estimation:

\max_{\bm{\theta}}\,\mathbb{E}_{\bm{x}}[\log p(\bm{x};\,\bm{\theta})].

For certain representative classes of data distributions $p$ and sufficiently expressive classes of models $p(\bm{x};\,\bm{\theta})$ , even simpler learning problems than maximum likelihood estimation are known to be statistically hard [YB99]. Hence it is desirable to exploit the knowledge that $\bm{x}$ has low-dimensional structure by seeking to factor the distribution $p(\bm{x};\,\bm{\theta})$ according to a low-dimensional “latent” variable model $\bm{z}$ . Indeed, by conditioning we may write $p(\bm{x},\bm{z};\,\bm{\theta})=p(\bm{z};\,\bm{\theta})p(\bm{x}\mid\bm{z};\,\bm{\theta})$ , and

	$\displaystyle p(\bm{x};\bm{\theta})$	$\displaystyle=\int p(\bm{z};\,\bm{\theta})p(\bm{x}\mid\bm{z};\,\bm{\theta})\mathrm{d}\bm{z}$
		$\displaystyle=\mathbb{E}_{\bm{z}\sim p(\,\cdot\,;\,\bm{\theta})}[p(\bm{x}\mid\bm{z};\,\bm{\theta})].$

Classically, our model for the data distribution $p(\bm{x};\,\bm{\theta})$ corresponds to a choice of the latent distribution $p(\bm{z};\,\bm{\theta})$ and the conditional distribution $p(\bm{x}\mid\bm{z};\,\bm{\theta})$ . Even so, computing the model for the data distribution from these latent distributions is intractable except in special cases, analogous to those we have studied in Chapter 2. By the same token, computing the posterior $p(\bm{z}\mid\bm{x};\,\bm{\theta})$ from data, allowing us to encode samples to their corresponding latents, is generally intractable. There is hence a trade-off between the expressivity of our generative model, its computational tractability, and the accuracy of any approximations we make to the underlying probabilistic framework for the sake of computational tractability. In navigating this trade-off, one also needs a flexible computational approach for learning the model parameters $\bm{\theta}$ from data, analogous to the maximum likelihood objective.

In the variational autoencoding framework, we navigate this trade-off through three key insights:

1.

We posit simple distributions for $\bm{z}$ and $\bm{x}$ conditional on $\bm{z}$ , but make their parameters depend in a highly flexible way on the input data $\bm{x}$ (where relevant) using deep networks.
2.

We replace the posterior $p(\bm{z}\mid\bm{x};\,\bm{\theta})$ , used for encoding and whose form is implied (by Bayes rule) by our modeling choices for $\bm{z}$ , with a tractable approximation $q(\bm{z}\mid\bm{x};\,\bm{\eta})$ , which has its own parameters $\bm{\eta}$ .
3.

We jointly learn $\bm{\theta}$ and $\bm{\eta}$ via maximizing a tractable lower bound for the likelihood $p(\bm{x};\,\bm{\theta})$ , known as the evidence lower bound (ELBO).

We will focus on the most useful instantiation of the VAE framework in practice, namely where the prior $p(\bm{z};\,\bm{\theta})$ and the conditional $p(\bm{x}\mid\bm{z};\,\bm{\theta})$ are both Gaussian. Namely, we use the following Gaussian distributions:

	$\displaystyle\bm{z}$	$\displaystyle\sim\mathcal{N}(\mathbf{0},\bm{I}),$
	$\displaystyle\bm{x}\mid\bm{z}$	$\displaystyle\sim\mathcal{N}(g_{1}(\bm{z}),\operatorname{diag}(e^{g_{2}(\bm{z})})\bm{I}),$

where $g=(g_{1},g_{2})$ are deep networks with parameters $\bm{\theta}$ , which correspond to the decoder in the autoencoder. Similarly, for the approximate posterior $q(\bm{z}\mid\bm{x};\,\bm{\eta})$ , we use a special Gaussian distribution with parameters given by an encoder MLP $f=(f_{1},f_{2})$ with parameters $\bm{\eta}$ :

\bm{z}\mid\bm{x}\sim\mathcal{N}(f_{1}(\bm{x}),\operatorname{diag}(e^{f_{2}(\bm{x})})\bm{I}).

This makes probabilistic encoding and decoding simple: we simply map our data $\bm{x}$ to the mean and variance parameters of a Gaussian distribution for encoding, and vice versa for decoding. For learning the encoder and decoder, we start from the maximum likelihood objective Section 5.1.4, and derive a convenient lower bound known as the evidence lower bound, or ELBO. Starting from simple algebraic manipulations

	$\displaystyle\log p(\bm{x};\,\bm{\theta})$	$\displaystyle=\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{p(\bm{z}\mid\bm{x};\,\bm{\theta})}$
		$\displaystyle=\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}+\log\frac{q(\bm{z}\mid\bm{x};\,\bm{\eta})}{p(\bm{z}\mid\bm{x};\,\bm{\theta})},$

we take expectations with respect to $\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})$ and use the Gibbs inequality (Theorem 3.1) to get

\log p(\bm{x};\,\bm{\theta})\geq\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}\right].

The right-hand side of this bound is the ELBO; as a lower bound for the pointwise log-likelihood of $p(\bm{x};\,\bm{\theta})$ , its maximization offers a principled compromise between the maximum likelihood objective and computational tractability. Interestingly, the derivation above shows that its tightness depends on the KL divergence between the approximate posterior $q(\bm{z}\mid\bm{x};\,\bm{\eta})$ and the true posterior $p(\bm{z}\mid\bm{x};\,\bm{\theta})$ , implying that the more accurate our approximate posterior is, the more maximization of the ELBO leads to maximization of the underlying objective of interest, the likelihood of the data. Thus, the VAE objective is:

\max_{\bm{\theta},\bm{\eta}}\,\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}\right].

(5.1.14)

By our derivation, maximization of this objective corresponds to a tradeoff between maximizing the likelihood function of the data and minimizing the KL divergence between the approximate and true posterior, which is a highly sensible objective given the VAE modeling assumptions.

VAE Training as Probabilistic Autoencoding

There is a general methodology for maximizing the ELBO objective in Equation 5.1.14 using stochastic gradient descent and various tractable Monte Carlo estimators for the associated gradients. However, the task is simpler under the Gaussian assumptions we have made above. In this case, the ELBO reads

	$\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}\right]$
$\displaystyle=$	$\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\left\{\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x},\bm{z};\,\bm{\theta})\right]+\frac{d\log(2\pi e)}{2}+\frac{1}{2}\langle\mathbb{E}_{\bm{x}}[f_{2}(\bm{x})],\mathbf{1}\rangle\right\}$
$\displaystyle\equiv$	$\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\left\{\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x}\mid\bm{z};\,\bm{\theta})+\log p(\bm{z})\right]+\frac{1}{2}\langle\mathbb{E}_{\bm{x}}[f_{2}(\bm{x})],\mathbf{1}\rangle\right\}$
$\displaystyle\equiv$	$\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\left\{2\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x}\mid\bm{z};\,\bm{\theta})\right]+\langle f_{2}(\bm{x})-e^{f_{2}(\bm{x})},\mathbf{1}\rangle-\\|f_{1}(\bm{x})\\|_{2}^{2}\right]\right\},$	(5.1.15)

following Theorem B.1 and the Gaussian entropy calculation done in Appendix B, where $\equiv$ denotes equivalence of optimization objectives (in each case, we remove some additive constants that do not change the optimization problem’s solutions). The remaining term corresponds to the “autoencoding” part of the ELBO objective: intuitively, it seeks to maximize the likelihood of data generated by the decoder when the latents $\bm{z}$ are distribited according to the approximate posterior (generated by the encoder applied to a data sample $\bm{x}$ ), which is a probabilistic form of autoencoding. To see this more directly, consider the special case where the approximate posterior concentrates on its mean $f_{1}(\bm{x})$ , for every $\bm{x}$ : this is the limit where its log-variance $f_{2,i}(\bm{x})\to-\infty$ for each coordinate $i=1,\dots,d$ . For simplicity, assume moreover that $g_{2}(\bm{x})=\mathbf{1}$ , giving the decoder constant unit variance on each coordinate. Then the loss term in question converges to

	$\displaystyle\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x}\mid\bm{z};\,\bm{\theta})\right]$	$\displaystyle\to\mathbb{E}_{\bm{x}}\left[\log p(\bm{x}\mid f_{1}(\bm{x});\,\bm{\theta})\right]$
		$\displaystyle\equiv-\frac{1}{2}\mathbb{E}_{\bm{x}}\left[\\|\bm{x}-g_{1}\circ f_{1}(\bm{x})\\|_{2}^{2}\right].$

So with a highly confident encoder, which deterministically maps each sample $\bm{x}$ to a single point $f_{1}(\bm{x})$ in $\bm{z}$ -space, and an isotropic decoder, the “autoencoding” part of the ELBO maximization problem indeed becomes a classical autoencoding objective!⁵⁵5In the general case where $g_{2}$ is not fixed as $\mathbf{1}$ , the reader can verify that the autoencoding term in the ELBO objective converges to a regularized classical autoencoding objective. At the same time, note that this special case is actually excluded by the extra terms in Equation 5.1.15—these effectively correspond to regularization terms that discourage the encoder from collapsing. The general ELBO loss (Equations 5.1.14 and 5.1.15) is therefore a strict generalization of the classical autoencoding reconstruction objective (Equation 5.1.12), both in terms of its data fidelity term and in terms of the inclusion of regularization terms.

Training a VAE

VAEs are typically trained by alternating stochastic gradient ascent on the ELBO objective (Equation 5.1.15), given individual samples $\bm{x}$ from the true data distribution and from $\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})$ . In particular, it is standard to collect and train on many independently-generated samples $\bm{z}^{i}$ for each sample $\bm{x}$ . To take gradients of Equation 5.1.15 with respect to the encoder parameters $\bm{\eta}$ , one makes use of the so-called reparameterization trick by writing $\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})$ as

\bm{z}=_{\mathrm{d}}f_{1}(\bm{x})+\operatorname{diag}(e^{\tfrac{1}{2}f_{2}(\bm{x})})\bm{g},\quad\bm{g}\sim\mathcal{N}(0,\bm{I}),

where $=_{\mathrm{d}}$ denotes equality in distribution. Then one can simply take many independent standard Gaussian samples $\bm{g}^{i}$ for each data sample $\bm{x}$ , generate the corresponding samples from the approximate posterior $\bm{z}^{i}$ , and compute gradients with respect to $\bm{\eta}$ using automatic differentiation without any issues.

5.2 Learning Self-Consistent Representations

In earlier chapters, we have studied methods that would allow us to learn a low-dimensional distribution via (lossy) compression. As we have mentioned in Chapter 1 and demonstrated in the previous chapters, the progresses made in machine intelligence largely rely on finding computationally feasible and efficient solutions to realize the desired compression, not only computable or tractable in theory, but also scalable in practice:

\mbox{{computable}}\;\Longrightarrow\;\mbox{{tractable}}\;\Longrightarrow\;\mbox{{scalable}}.

(5.2.1)

It is even fair to say that the tremendous advancement in machine intelligence made in the past decade or so is largely attributed to the development of scalable models and methods, say by training deep networks via back propagation. Large models with billions of parameters trained with trillions of data points on tens of thousands of powerful GPUs have demonstrated super-human capabilities in memorizing existing knowledge. This has led many to believe that the “intelligence” of such models will continue to improve as their scale continues to go up.

While we celebrate the engineering marvels of such large man-made machine learning systems, we also must admit that, compared to intelligence in nature, this approach to improving machine intelligence is unnecessarily resource-demanding. Natural intelligent beings, including animals and humans, simply cannot afford such a brute-force solution for learning because they must operate with a very limited budget in energy, space, and time, subject to many strict physical constraints.

First, there is strong scientific evidence that our brain does not conduct global end-to-end back propagation to improve or correct its predictions. Instead, it has long been known in neuroscience that our brain corrects errors with local closed-loop feedback, such as predictive coding. This was the scientific basis that inspired Norbert Wiener to develop the theory of feedback control and the Cybernetics program back in the 1940s.

Second, we saw in the previous sections that in order to learn a consistent representation, one needs to learn a bidirectional autoencoding:

\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}.

(5.2.2)

It requires enforcing the observed input data $\bm{X}$ and the decoded $\hat{\bm{X}}$ to be close by some measure of similarity $d(\bm{X},\hat{\bm{X}})$ . In nature, animals or humans rarely have direct access to the ground truth $\bm{X}$ . For example, we never have direct access to true 3D shape, distance, or dynamics of objects in a scene. Yet, we have all learned to estimate and predict them very accurately and efficiently. Hence, an outstanding question is: how can we learn the true distribution of $\bm{X}$ , even though we cannot directly compare our estimate $\hat{\bm{x}}$ with the ground truth $\bm{x}$ ? As we will see in this chapter, the answer also lies in the closed-loop feedback, as well as the low-dimensionality of the data distribution.

As we know from the previous chapter, in order to ensure that a representation is consistent, we need to compare the generated $\hat{\bm{X}}\sim p(\hat{\bm{x}})$ and the original $\bm{X}\sim p(\bm{x})$ , at least in distribution. Even when we do have access to $\bm{X}$ and $\hat{\bm{X}}$ , technically, computing and minimizing the distance between two distributions can be problematic, especially when the support of the distributions is low-dimensional. The KL-divergence introduced in Chapter 3 is not even well-defined between two distributions that do not have overlapping supports.

As an early attempt to alleviate the above difficulty in computing and minimizing the distance between two (low-dimensional) distributions, people had suggested learning the generator/decoder $g$ via discriminative approaches [Tu07]. This line of thought has led to the idea of Generative Adversarial Nets (GAN) [GPM+14a]. It introduces a discriminator $d$ , usually modeled by a deep network, to discern differences between the generated samples $\hat{\bm{X}}$ and the real ones $\bm{X}$ :

\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\eta)\hskip 5.69054pt}\hat{\bm{X}},\,\bm{X}\xrightarrow{\hskip 5.69054ptd(\bm{x},\theta)\hskip 5.69054pt}\{\mathbf{0},\mathbf{1}\}.

(5.2.3)

It is suggested that we may attempt to align the distributions between $\hat{\bm{x}}$ and $\bm{x}$ via a Stackelberg game between the generator $g$ and the discriminator $d$ :

\max_{\theta}\min_{\eta}\mathbb{E}_{p({\bm{x}})}\big{[}\log d(\bm{x},\theta)\big{]}+\mathbb{E}_{p({\bm{z}})}\big{[}1-\log d(\underbrace{g(\bm{z},\eta)}_{\hat{\bm{x}}\,\sim\,p_{g}},\theta)\big{]}.

(5.2.4)

That is, the discriminator $d$ is trying to minimize the cross entropy between the true samples $\bm{X}$ and the generated ones $\hat{\bm{X}}$ while the generator $g$ is trying to do the opposite.

One may show that finding an equilibrium for the above Stackelberg game is equivalent to minimizing the Jensen-Shannon divergence:

\mathcal{D}_{JS}(p(\bm{x}),p_{g}(\hat{\bm{x}}))=\mathcal{D}_{KL}\big{(}p\|(p+p_{g})/{2}\big{)}+\mathcal{D}_{KL}\big{(}p_{g}\|(p+p_{g})/{2}\big{)}.

(5.2.5)

Note that, compared to the KL-divergence, the JS-divergence is well-defined even if the supports of the two distributions are non-overlapping. However, JS-divergence does not have a closed-form expression even between two Gaussians, whereas KL-divergence does. In addition, since the data distributions are low-dimensional, the JS-divergence can be highly ill-conditioned to optimize.⁶⁶6as shown in [ACB17]. This may explain why many additional heuristics are typically used in many subsequent variants of GAN.

So, instead, it has also been suggested to replace JS-divergence with the earth mover (EM) distance or the Wasserstein distance.⁷⁷7Roughly speaking, for distributions with potentially non-overlapping low-dimensional supports, the JS-divergence behaves like the $\ell^{0}$ -norm, and the EM-distance behaves like the $\ell^{1}$ -norm. However, both the JS-divergence and Wasserstein distance can only be approximately computed between two general distributions.⁸⁸8For instance, the Wasserstein distance requires one to compute the maximal difference between expectations of the two distributions over all 1-Lipschitz functions. Furthermore, neither the JS-divergence nor the Wasserstein distance has closed-form formulae, even for Gaussian distributions.⁹⁹9The ( $\ell^{1}$ -norm) Wasserstein distance can be bounded by the ( $\ell^{2}$ -norm) Wasserstein distance which has a closed-form [DDS22]. However, as is well-known in high-dimensional geometry, $\ell^{1}$ -norm and $\ell^{2}$ norm deviate significantly in terms of their geometric and statistical properties as the dimension becomes high [WM21]. The bound can become very loose.

If it is difficult to compare distributions of the data $\bm{X}$ and $\hat{\bm{X}}$ , would it be possible to compare the learned feature $\bm{Z}$ with its image $\hat{\bm{Z}}$ under the encoder $f$ :

\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\hat{\bm{Z}}?

(5.2.6)

This leads to the notion of self-consistent representation.

Definition 5.2 (Self-Consistent Representation).

Given data $\bm{X}$ , we call a self-consistent representation a pair of functions $(f\colon\mathcal{X}\to\mathcal{Z},g\colon\mathcal{Z}\to\mathcal{X})$ , such that the features $\bm{Z}=f(\bm{X})$ are compact and structured, and the autoencoding features $\hat{\bm{Z}}\doteq f\circ g(\bm{Z})$ are close to $\bm{Z}$ .

1.

We say that it is sample-wise self-consistent if $\bm{Z}\approx\hat{\bm{Z}}$ in a certain norm with high probability.
2.

We say that the representation is distributionally self-consistent if $\operatorname{Law}(\bm{Z})\approx\operatorname{Law}(\hat{\bm{Z}})$ .

5.2.1 Closed-Loop Transcription via Stackelberg Games

How do we try to ensure a learned representation is self-consistent? As usual, let us assume $\bm{X}=\cup_{k=1}^{K}\bm{X}_{k}$ with each subset of samples $\bm{X}_{k}$ belonging to a low-dimensional submanifold: $\bm{X}_{k}\subset\mathcal{M}_{k},k=1,\ldots,K$ . Following the notation in Chapter 3, we use a matrix $\bm{\Pi}_{k}(i,i)=1$ to denote the membership of sample $i$ belonging to class $k$ and $\bm{\Pi}_{k}(i,j)=0$ otherwise. One seeks a continuous mapping $f(\cdot,\bm{\theta}):\bm{x}\mapsto\bm{z}$ from $\bm{X}$ to an optimal representation $\bm{Z}=[\bm{z}_{1},\ldots,\bm{z}_{N}]\subset\mathbb{R}^{d\times N}$ :

\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\bm{Z},

(5.2.7)

which maximizes the following coding rate-reduction (MCR²) objective:

\begin{split}\max_{\bm{Z}}\;\Delta R_{\epsilon}(\bm{Z}\mid\bm{\Pi})&\doteq\underbrace{\frac{1}{2}\log\det\Big{(}\bm{I}+{\alpha}\bm{Z}\bm{Z}^{\top}\Big{)}}_{R_{\epsilon}(\bm{Z})}\;-\;\underbrace{\sum_{k=1}^{K}\frac{\gamma_{k}}{2}\log\det\Big{(}\bm{I}+{\alpha_{k}}\bm{Z}\bm{\Pi}^{j}\bm{Z}^{\top}\Big{)}}_{R^{c}_{\epsilon}(\bm{Z}\mid\bm{\Pi})},\end{split}

(5.2.8)

where $\alpha={d}/({N\epsilon^{2}})$ , $\alpha_{k}=d/({\mathrm{tr}(\bm{\Pi}_{k})\epsilon^{2}})$ , $\gamma_{k}={\mathrm{tr}(\bm{\Pi}_{k})}/{N}$ for each $k=1,\dots,K$ .

One issue with learning such a one-sided mapping (5.2.7) via maximizing the above objective (5.2.8) is that it tends to expand the dimension of the learned subspace for features in each class¹⁰¹⁰10If the dimension of the feature space $d$ is too high, maximizing the rate reduction may over-estimate the dimension of each class. Hence, to learn a good representation, one needs to pre-select a proper dimension for the feature space, as achieved in the experiments in [YCY+20]. In fact the same “model selection” problem persists even in the simplest single-subspace case, which is the classic PCA [Jol86]. To our best knowledge, selecting the correct number of principal components in a heterogeneous noisy situation remains an active research topic [HSD20]., as we have seen in Section 3.4 of Chapter 3. To verify whether the learned features are consistent, meaning neither over-estimating nor under-estimating the intrinsic data structure, we may consider learning a decoder $g(\cdot,\eta):\bm{z}\mapsto\bm{x}$ from the representation $\bm{Z}=f(\bm{X},\bm{\theta})$ back to the data space $\bm{x}$ : $\hat{\bm{X}}=g(\bm{Z},\bm{\eta})$ :

\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}},

(5.2.9)

and check how close $\bm{X}$ and $\hat{\bm{X}}$ are. This forms an autoencoding which is what we have studied extensively in the previous Chapter 5.

Measuring distance in the feature space.

However, as we have discussed above, if we do not have the option to compute the distance between $\bm{X}$ and $\hat{\bm{X}}$ , we are left with the option of comparing their corresponding features $\bm{Z}$ and $\hat{\bm{Z}}=f(\hat{\bm{X}},\bm{\theta})$ . Notice that under the MCR² objective, the distributions of the resulting $\bm{Z}$ or $\hat{\bm{Z}}$ tend to be piecewise linear so their distance can be computed more easily. More precisely, since the features of each class, $\bm{Z}_{k}$ and $\hat{\bm{Z}}_{k}$ , are close to being a low-dimensional subspace/Gaussian, their “distance” can be measured by the rate reduction, with (5.2.8) restricted to two sets of equal size:

\Delta R_{\epsilon}\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}\doteq R_{\epsilon}\big{(}\bm{Z}_{k}\cup\hat{\bm{Z}}_{k}\big{)}-\frac{1}{2}\big{(}R_{\epsilon}\big{(}\bm{Z}_{k})+R_{\epsilon}\big{(}\hat{\bm{Z}}_{k})\big{)}.

(5.2.10)

The overall distance between $\bm{Z}$ and $\hat{\bm{Z}}$ is given by:

d(\bm{Z},\hat{\bm{Z}})\doteq\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}=\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k},f(g(\bm{Z}_{k},\bm{\eta}),\bm{\theta})\big{)}.

(5.2.11)

If we are interested in discerning any differences in the distributions of the original data $\bm{X}$ and their decoded $\hat{\bm{X}}$ , we may view the feature encoder $f(\cdot,\bm{\theta})$ as a “discriminator” to magnify any difference between all pairs of $\bm{X}_{k}$ and $\hat{\bm{X}}_{k}$ , by simply maximizing the distance $d(\bm{X},\hat{\bm{X}})$ :

\max_{f}d(\bm{Z},\hat{\bm{Z}})=\max_{\bm{\theta}}\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}=\max_{\bm{\theta}}\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}f(\bm{X}_{k},\bm{\theta}),f(\hat{\bm{X}}_{k},\bm{\theta})\big{)}.

(5.2.12)

That is, a “distance” between $\bm{X}$ and $\hat{\bm{X}}$ can be measured as the maximally achievable rate reduction between all pairs of classes in these two sets. In a way, this measures how well or badly the decoded $\hat{\bm{X}}$ aligns with the original data $\bm{X}$ —hence measuring the goodness of “injectivity” of the encoder $f$ . Notice that such a discriminative measure is consistent with the idea of GAN (5.2.3) that tries to separate $\bm{X}$ and $\hat{\bm{X}}$ into two classes, measured by the cross-entropy.

Nevertheless, here the discriminative encoder $f$ naturally generalizes to cases when the data distributions are multi-class and multi-modal, and the discriminativeness is measured with a more refined measure—the rate reduction—instead of the typical two-class loss (e.g., cross entropy) used in GANs. That is, we may view the encoder $f$ as a generalized discriminator that replaces the binary classifier $d$ in (5.2.3):

\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}},\,\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\{\hat{\bm{Z}},\bm{Z}\}.

(5.2.13)

The overall pipeline can be illustrated by the following “closed-loop” diagram:

\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\ \hat{\bm{Z}},

(5.2.14)

where the overall model has parameters: $\bm{\Theta}=\{\bm{\theta},\bm{\eta}\}$ . Figure 5.6 shows the overall process.

Figure 5.6 : A Closed-loop Transcription. The encoder f f italic_f has dual roles: it learns a representation 𝒛 \bm{z} bold_italic_z for the data 𝒙 \bm{x} bold_italic_x via maximizing the rate reduction of 𝒛 \bm{z} bold_italic_z and it is also a “feedback sensor” for any discrepancy between the data 𝒙 \bm{x} bold_italic_x and the decoded 𝒙 ^ \hat{\bm{x}} over^ start_ARG bold_italic_x end_ARG . The decoder g g italic_g also has dual roles: it is a “controller” that corrects the discrepancy between 𝒙 \bm{x} bold_italic_x and 𝒙 ^ \hat{\bm{x}} over^ start_ARG bold_italic_x end_ARG and it also aims to minimize the overall coding rate for the learned representation. — Figure 5.6: A Closed-loop Transcription. The encoder $f$ has dual roles: it learns a representation $\bm{z}$ for the data $\bm{x}$ via maximizing the rate reduction of $\bm{z}$ and it is also a “feedback sensor” for any discrepancy between the data $\bm{x}$ and the decoded $\hat{\bm{x}}$ . The decoder $g$ also has dual roles: it is a “controller” that corrects the discrepancy between $\bm{x}$ and $\hat{\bm{x}}$ and it also aims to minimize the overall coding rate for the learned representation.

Encoder and decoder as a two-player game.

Obviously, to ensure the learned autoencoding is self-consistent, the main goal of the decoder $g(\cdot,\bm{\eta})$ is to minimize the distance between $\bm{Z}$ and $\hat{\bm{Z}}$ . That is, to learn $g$ , we want to minimize the distance $d(\bm{Z},\hat{\bm{Z}})$ :

\min_{g}d(\bm{Z},\hat{\bm{Z}})\doteq\min_{\eta}\sum_{k=1}^{K}\Delta R\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}=\min_{\bm{\eta}}\sum_{k=1}^{K}\Delta R\big{(}\bm{Z}_{k},f(g(\bm{Z}_{k},\bm{\eta}),\bm{\theta})\big{)},

(5.2.15)

where $\bm{Z}_{k}=f(\bm{X}_{k},\bm{\theta})$ and $\hat{\bm{Z}}_{k}=f(\hat{\bm{X}}_{k},\bm{\theta})$ .

Example 5.2.

One may wonder why we need the mapping $f(\cdot,\bm{\theta})$ to function as a discriminator between $\bm{X}$ and $\hat{\bm{X}}$ by maximizing $\max_{\bm{\theta}}\Delta R_{\epsilon}\big{(}f(\bm{X},\bm{\theta}),f(\hat{\bm{X}},\bm{\theta})\big{)}$ . Figure 5.7 gives a simple illustration: there might be many decoders $g$ such that $f\circ g$ is an identity (Id) mapping. $f\circ g(\bm{z})=\bm{z}$ for all $\bm{z}$ in the subspace $S_{\bm{z}}$ in the feature space. However, $g\circ f$ is not necessarily an autoencoding map for $\bm{x}$ in the original distribution $S_{\bm{x}}$ (here drawn as a subspace for simplicity). That is, $g\circ f(S_{\bm{x}})\not\subset S_{\bm{x}}$ , let alone $g\circ f(S_{\bm{x}})=S_{\bm{x}}$ or $g\circ f(\bm{x})=\bm{x}$ . One should expect that, without careful control of the image of $g$ , this would be the case with high probability, especially when the support of the distribution of $\bm{x}$ is extremely low-dimensional in the original high-dimensional data space. $\blacksquare$

Figure 5.7 : Embeddings of low-dim submanifolds in a high-dim space. S 𝒙 S_{\bm{x}} italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT (blue) is the submanifold for the original data 𝒙 \bm{x} bold_italic_x ; S 𝒛 S_{\bm{z}} italic_S start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT (red) is the image of S 𝒙 S_{\bm{x}} italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT under the mapping f f italic_f , representing the learned feature 𝒛 \bm{z} bold_italic_z ; and the green curve is the image of the feature 𝒛 \bm{z} bold_italic_z under the decoding mapping g g italic_g . — Figure 5.7: Embeddings of low-dim submanifolds in a high-dim space. $S_{\bm{x}}$ (blue) is the submanifold for the original data $\bm{x}$ ; $S_{\bm{z}}$ (red) is the image of $S_{\bm{x}}$ under the mapping $f$ , representing the learned feature $\bm{z}$ ; and the green curve is the image of the feature $\bm{z}$ under the decoding mapping $g$ .

Comparing the contractive and contrastive nature of (5.2.15) and (5.2.12) on the same distance measure, we see the roles of the encoder $f(\cdot,\bm{\theta})$ and the decoder $g(\cdot,\bm{\eta})$ naturally as “a two-player game”: while the encoder $f$ tries to magnify the difference between the original data and their transcribed data, the decoder $g$ aims to minimize the difference. Now for convenience, let us define the “closed-loop encoding” function:

h(\bm{x},\bm{\theta},\bm{\eta})\doteq f\big{(}g\big{(}f(\bm{x},\bm{\theta}),\bm{\eta}\big{)},\bm{\theta}\big{)}:\;\bm{x}\mapsto\hat{\bm{z}}.

(5.2.16)

Ideally, we want this function to be very close to $f(\bm{x},\bm{\theta})$ or at least the distributions of their images should be close. With this notation, combining (5.2.15) and (5.2.12), a closed-loop notion of “distance” between $\bm{X}$ and $\hat{\bm{X}}$ can be computed as an equilibrium point to the following Stackelberg game (cf Section A.3) for the same utility in terms of rate reduction:

\mathcal{D}(\bm{X},\hat{\bm{X}})\doteq\max_{\bm{\theta}}\min_{\bm{\eta}}\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}f(\bm{X}_{k},\bm{\theta}),h(\bm{X}_{k},\bm{\theta},\bm{\eta})\big{)}.

(5.2.17)

Notice that this only measures the difference between (features of) the original data and its transcribed version. It does not measure how good the representation $\bm{Z}$ (or $\hat{\bm{Z}}$ ) is for the multiple classes within $\bm{X}$ (or $\hat{\bm{X}}$ ). To this end, we may combine the above distance with the original MCR²-type objectives (5.2.8): namely, the rate reduction $\Delta R_{\epsilon}(\bm{Z})$ and $\Delta R_{\epsilon}(\hat{\bm{Z}})$ for the learned LDR $\bm{Z}$ for $\bm{X}$ and $\hat{\bm{Z}}$ for the decoded $\hat{\bm{X}}$ . Notice that although the encoder $f$ tries to maximize the multi-class rate reduction of the features $\bm{Z}$ of the data $\bm{X}$ , the decoder $g$ should minimize the rate reduction of the multi-class features $\hat{\bm{Z}}$ of the decoded $\hat{\bm{X}}$ . That is, the decoder $g$ tries to use a minimal coding rate needed to achieve a good decoding quality.

Hence, the overall “multi-class” Stackelberg game for learning the closed-loop transcription, named CTRL-Multi, is

	$\displaystyle\max_{\bm{\theta}}\min_{\bm{\eta}}\mathcal{T}_{\bm{X}}(\bm{\theta},\bm{\eta})$	(5.2.18)
$\displaystyle\doteq$	$\displaystyle\underbrace{\Delta R_{\epsilon}\big{(}f(\bm{X},\bm{\theta})\big{)}}_{\text{Expansive encode}}+\underbrace{\Delta R_{\epsilon}\big{(}h(\bm{X},\bm{\theta},\bm{\eta})\big{)}}_{\text{Compressive decode}}+\sum_{k=1}^{K}\underbrace{\Delta R_{\epsilon}\big{(}f(\bm{X}_{k},\bm{\theta}),h(\bm{X}_{k},\bm{\theta},\bm{\eta})\big{)}}_{\text{Contrastive encode \& Contractive decode}}$	(5.2.19)
$\displaystyle=$	$\displaystyle\Delta R_{\epsilon}\big{(}\bm{Z}(\bm{\theta})\big{)}+\Delta R_{\epsilon}\big{(}\hat{\bm{Z}}(\bm{\theta},\bm{\eta})\big{)}+\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k}(\bm{\theta}),\hat{\bm{Z}}_{k}(\bm{\theta},\bm{\eta})\big{)},$	(5.2.20)

subject to certain constraints (upper or lower bounds) on the first term and the second term.

Notice that, without the terms associated with the generative part $h$ or with all such terms fixed as constant, the above objective is precisely the original MCR² objective introduced in Chapter 3. In an unsupervised setting, if we view each sample (and its augmentations) as its own class, the above formulation remains exactly the same. The number of classes $k$ is simply the number of independent samples. In addition, notice that the above game’s objective function depends only on (features of) the data $\bm{X}$ , hence one can learn the encoder and decoder (parameters) without the need for sampling or matching any additional distribution (as typically needed in GANs or VAEs).

As a special case, if $\bm{X}$ only has one class, the Stackelberg game reduces to a special “two-class” or “binary” form,¹¹¹¹11As the first two rate reduction terms automatically become zero named CTRL-Binary,

\max_{\bm{\theta}}\min_{\bm{\eta}}\mathcal{T}^{b}_{\bm{X}}(\bm{\theta},\bm{\eta})\doteq\Delta R_{\epsilon}\big{(}f(\bm{X},\bm{\theta}),h(\bm{X},\bm{\theta},\bm{\eta})\big{)}=\Delta R_{\epsilon}\big{(}\bm{Z}(\bm{\theta}),\hat{\bm{Z}}(\bm{\theta},\bm{\eta})\big{)},

(5.2.21)

between $\bm{X}$ and the decoded $\hat{\bm{X}}$ by viewing $\bm{X}$ and $\hat{\bm{X}}$ as two classes $\{\bm{0},\bm{1}\}$ . Notice that this binary case resembles the formulation of the original GAN (5.2.3). Instead of using cross entropy, our formulation adopts a more refined rate-reduction measure, which has been shown to promote diversity in the learned representation in Chapter 3.

Sometimes, even when $\bm{X}$ contains multiple classes/modes, one could still view all classes together as one class. Then, the above binary objective is to align the union distribution of all classes with their decoded $\hat{\bm{X}}$ . This is typically a simpler task to achieve than the multi-class one (5.2.20), since it does not require learning of a more refined multi-class CTRL for the data, as we will later see in experiments. Notice that one good characteristic of the above formulation is that all quantities in the objectives are measured in terms of rate reduction for the learned features (assuming features eventually become subspace Gaussians).

One may notice that the above learning framework draws inspiration from closed-loop error correction widely practiced in feedback control systems. The closed-loop mechanism is used to form an overall feedback system between the two encoding and decoding networks for correcting any “error” in the distributions between the data $\bm{x}$ and the decoded $\hat{\bm{x}}$ . Using terminology from control theory, one may view the encoding network $f$ as a “sensor” for error feedback while the decoding network $g$ as a “controller” for error correction. However, notice that here the “target” for control is not a scalar nor a finite dimensional vector, but a continuous distribution—in order for the distribution of $\hat{\bm{x}}$ to match that of the data $\bm{x}$ . This is in general a control problem in an infinite dimensional space. The space of possible diffeomorphisms of submanifolds that $f$ tries to model is infinite-dimensional [Lee02]. Ideally, we hope when the sensor $f$ and the controller $g$ are optimal, the distribution of $\bm{x}$ becomes a “fixed point” for the closed loop while the distribution of $\bm{z}$ reaches a compact linear discriminative representation. Hence, the minimax programs (5.2.20) and (5.2.21) can also be interpreted as games between an error-feedback sensor and an error-reducing controller.

The remaining question is whether the above framework can indeed learn a good (autoencoding) representation of a given dataset? Before we give some formal theoretical justification (in the next subsection), we present some empirical results.

Visualizing correlation of features $\bm{Z}$ and decoded features $\hat{\bm{Z}}$ .

We visualize the cosine similarity between $\bm{Z}$ and $\hat{\bm{Z}}$ learned from the multi-class objective (5.2.20) on MNIST, CIFAR-10 and ImageNet (10 classes), which indicates how close $\hat{\bm{z}}=f\circ g(\bm{z})$ is from $\bm{z}$ . Results in Figure 5.8 show that $\bm{Z}$ and $\hat{\bm{Z}}$ are aligned very well within each class. The block-diagonal patterns for MNIST are sharper than those for CIFAR-10 and ImageNet, as images in CIFAR-10 and ImageNet have more diverse visual appearances.

Visualizing autoencoding of the data $\bm{X}$ and the decoded $\hat{\bm{X}}$ .

We compare some representative $\bm{X}$ and $\hat{\bm{X}}$ on MNIST, CIFAR-10 and ImageNet (10 classes) to verify how close $\hat{\bm{x}}=g\circ f(\bm{x})$ is to $\bm{x}$ . The results are shown in Figure 5.9, and visualizations are created from training samples. Visually, the autoencoded $\hat{\bm{x}}$ faithfully captures major visual features from its respective training sample $\bm{x}$ , especially the pose, shape, and layout. For the simpler dataset such as MNIST, autoencoded images are almost identical to the original.

(a) MNIST 𝑿 \bm{X} bold_italic_X — (a) MNIST $\bm{X}$

5.2.2 A Mixture of Low-Dimensional Gaussians

In the above, we have argued that it is possible to formulate the problem of learning a data distribution as a closed-loop autoencoding problem. We also saw empirically that such a scheme seems to work. The remaining question is when and why such a scheme should work. It is difficult to answer this question for the most general cases with arbitrary data distributions. Nevertheless, as usual, let us see if we can arrive at a rigorous justification for the ideal case when the data distribution is a mixture of low-dimensional subspaces or low-rank Gaussians. A clear characterization and understanding of this important special case would shed light on the more general cases.¹²¹²12As most distributions with low-dimensional structures can be well-approximated by this family of distributions.

To this end, let us first suppose that $\bm{X}$ is distributed according to a mixture of low-dimensional Gaussians, and the label (i.e., subspace assignment) for $\bm{X}$ is given by $\bm{y}$ . Then, let us set up a minimax optimization problem to learn the data distribution, say through learning an encoding of $\bm{X}$ into representations $\bm{Z}$ which are supported on a mixture of orthogonal subspaces, and a decoding of $\bm{Z}$ back into $\bm{X}$ . Then, the representation we want to achieve is maximized by the earlier-discussed version of the information gain, i.e., $\Delta R_{\epsilon}(\bm{Z})=R_{\epsilon}(\bm{Z})-\sum_{k=1}^{K}R_{\epsilon}(\bm{Z}_{k})$ , which enforces that the representation $\bm{Z}_{k}$ of each class $k$ spans a subspace which is orthogonal to the supporting subspaces of other classes. The way to measure the consistency of the decoding is, as before, given by $\sum_{k=1}^{K}\Delta R_{\epsilon}(\bm{Z}_{k},\hat{\bm{Z}}_{k})$ , which enforces that the representation $\bm{Z}_{k}$ and its autoencoding $\hat{\bm{Z}}_{k}$ for each class $k$ span the same subspace. Thus, we can set up a simplified Stackelberg game:

\max_{\bm{\theta}}\min_{\bm{\eta}}\left\{\Delta R_{\epsilon}(\bm{Z}(\bm{\theta}))+\sum_{k=1}^{K}\Delta R_{\epsilon}(\bm{Z}_{k}(\bm{\theta}),\hat{\bm{Z}}_{k}(\bm{\theta},\bm{\eta}))\right\}

(5.2.22)

Notice that this is a simpler setup than what is used in practice—there is no $\Delta R_{\epsilon}(\hat{\bm{Z}})$ term, for instance, and we work in the supervised setting with class labels (although the techniques used to prove the following result are easy to extend to unsupervised formulations). Also, the consistency of the representations is only measured in a distribution-wise sense via $\Delta R_{\epsilon}$ (though this may be substituted with a sample-wise distance metric such as the $\ell_{2}$ norm if desired, and equivalent conclusions may be drawn, mutatis mutandis).

Under mild conditions, in order to realize the desired encoder and decoder which realize $\bm{Z}$ from a data source $\bm{X}$ that is already distributed according to a mixture of correlated low-dimensional Gaussians, we only require a linear encoder and decoder to disentangle and whiten the Gaussians. We then study this setting in the case where $\bm{\theta}$ and $\bm{\eta}$ parameterize matrices whose operator norm is constrained.

We want to understand what kinds of optima are learned in this setting. One suitable solution concept for this kind of game, where the decoder’s optimal solution is defined solely with respect to the encoder (and not, in particular, with respect to some other intrinsic property of the decoder), is a Stackelberg equilibrium where the decoder follows the encoder. Namely, at such an equilibrium, the decoder should optimize its objective; meanwhile the encoder should optimize its objective, given that whatever it picks, the decoder will pick an optimal response (which may affect the encoder objective). In game-theoretic terms, it is like the decoder goes second: it chooses its weights after the encoder, and both the encoder and decoder attempt to optimize their objective in light of this. It is computationally tractable to learn sequential equilibria via gradient methods via alternating optimization, where each side uses different learning rates. Detailed exposition of sequential equilibria is beyond the scope of this book and we provide more technical details in Section A.3. In this setting, we have the following result:

Theorem 5.1 ([PPC+23], Abridged).

Suppose that $\bm{X}$ is distributed on a mixture of subspaces. Under certain realistic yet technical conditions, it holds that all sequential equilibria of (5.2.22) obey:

•

The $\bm{Z}_{k}$ lie on orthogonal subspaces and are isotropic on those subspaces, i.e., maximizing the information gain.
•

The autoencoding is self-consistent, i.e., the subspaces spanned by $\bm{Z}_{k}$ and $\hat{\bm{Z}}_{k}$ are the same for all $k$ .

This notion of self-consistency is the most one can expect if there are only geometric assumptions on the data, i.e., there are no statistical assumptions. If we assume that the columns of $\bm{X}$ is drawn from a low-rank Gaussian mixture model, then analogous versions of this theorem certify that $\bm{Z}_{k}$ are also low-rank Gaussians whose covariance is isotropic, for instance. Essentially, this result validates, via the simple case of Gaussian mixtures on subspaces, that minimax games to optimize the information gain and self-consistency may achieve optimal solutions.

5.3 Continuous Learning Self-Consistent Representations

5.3.1 Class-wise Incremental Learning

As we have seen, deep neural networks have demonstrated a great ability to learn representations for hundreds or even thousands of classes of objects, in both discriminative and generative contexts. However, networks typically must be trained offline, with uniformly sampled data from all classes simultaneously. It has been known that when an (open-loop) network is updated to learn new classes without data from the old ones, previously learned knowledge will fall victim to the problem of catastrophic forgetting [MC89]. This is known in neuroscience as the stability-plasticity dilemma: the challenge of ensuring that a neural system can learn from a new environment while retaining essential knowledge from previous ones [Gro87].

In contrast, natural neural systems (e.g., animal brains) do not seem to suffer from such catastrophic forgetting at all. They are capable of developing new memory of new objects while retaining memory of previously learned objects. This ability, for either natural or artificial neural systems, is often referred to as incremental learning, continual learning, sequential learning, or lifelong learning [AR20].

While many recent works have highlighted how artificial neural systems can be trained in more flexible ways, the strongest existing efforts toward answering the stability-plasticity dilemma for artificial neural networks typically require either storing raw exemplars [RKS+17, CRE+19] or providing external mechanisms [KPR+17]. Raw exemplars, particularly in the case of high-dimensional inputs like images, are costly and difficult to scale, while external mechanisms—which typically include secondary networks and representation spaces for generative replay, incremental allocation of network resources, network duplication, or explicit isolation of used and unused parts of the network—require heuristics and incur hidden costs.

Here we are interested in an incremental learning setting that is similar to nature. It counters these existing practices with two key qualities.

1.

The first is that it should be memory-based. When learning new classes, no raw exemplars of old classes are available to train the network together with new data. This implies that one has to rely on a compact and thus structured “memory” learned for old classes, such as incrementally learned generative representations of the old classes, as well as the associated encoding and decoding mappings [KK18].
2.

The second is that it should be self-contained. Incremental learning takes place in a single neural system with a fixed capacity, and in a common representation space. The ability to minimize forgetting is implied by optimizing an overall learning objective, without external networks, architectural modifications, or resource allocation mechanisms.

The incoherent linear structures for features of different classes closely resemble how objects are encoded in different areas of the inferotemporal cortex of animal brains [CT17, BSM+20]. The closed-loop transcription $\bm{X}\rightarrow\bm{Z}\rightarrow\hat{\bm{X}}\rightarrow\hat{\bm{Z}}$ also resembles popularly hypothesized mechanisms for memory formation [VST+20, JT20]. This leads to a question: since memory in the brain is formed in an incremental fashion, can the above closed-loop transcription framework also support incremental learning?

LDR memory sampling and replay.

The simple linear structures of LDR make it uniquely suited for incremental learning: the distribution of features $\bm{Z}_{j}$ of each previously learned class can be explicitly and concisely represented by a principal subspace $\mathcal{S}_{j}$ in the feature space. To preserve the memory of an old class $j$ , we only need to preserve the subspace while learning new classes. To this end, we simply sample $m$ representative prototype features on the subspace along its top $r$ principal components, and denote these features as $\bm{Z}_{j,old}$ . Because of the simple linear structures of LDR, we can sample from $\bm{Z}_{j,old}$ by calculating the mean and covariance of $\bm{Z}_{j,old}$ after learning class $j$ . The storage required is extremely small, since we only need to store means and covariances, which are sampled from as needed. Suppose a total of $t$ old classes have been learned so far. If prototype features, denoted $\bm{Z}_{old}\doteq[\bm{Z}^{1}_{old},\ldots,\bm{Z}^{t}_{old}]$ , for all of these classes can be preserved when learning new classes, the subspaces $\{\mathcal{S}_{j}\}_{j=1}^{t}$ representing past memory will be preserved as well. Details about sampling and calculating mean and covariance can be found in the work of [TDW+23].

Figure 5.10 : Overall framework of our closed-loop transcription-based incremental learning for a structured LDR memory. Only a single, entirely self-contained, encoding-decoding network is needed: for a new data class 𝑿 n e w \bm{X}_{new} bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , a new LDR memory 𝒁 n e w \bm{Z}_{new} bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is incrementally learned as a minimax game between the encoder and decoder subject to the constraint that old memory of past classes 𝒁 o l d \bm{Z}_{old} bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT is intact through the closed-loop transcription (or replay): 𝒁 o l d ≈ 𝒁 ^ o l d = f ( g ( 𝒁 o l d ) ) \bm{Z}_{old}\approx\hat{\bm{Z}}_{old}=f(g(\bm{Z}_{old})) bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ≈ over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = italic_f ( italic_g ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) ) . — Figure 5.10: Overall framework of our closed-loop transcription-based incremental learning for a structured LDR memory. Only a single, entirely self-contained, encoding-decoding network is needed: for a new data class $\bm{X}_{new}$ , a new LDR memory $\bm{Z}_{new}$ is incrementally learned as a minimax game between the encoder and decoder subject to the constraint that old memory of past classes $\bm{Z}_{old}$ is intact through the closed-loop transcription (or replay): $\bm{Z}_{old}\approx\hat{\bm{Z}}_{old}=f(g(\bm{Z}_{old}))$ .

Incremental learning LDR with an old-memory constraint.

Notice that, with the learned autoencoding (5.2.9), one can replay and use the images, say $\hat{\bm{X}}_{old}=g(\bm{Z}_{old},\bm{\eta})$ , associated with the memory features to avoid forgetting while learning new classes. This is typically how generative models have been used for prior incremental learning methods. However, with the closed-loop framework, explicitly replaying images from the features is not necessary. Past memory can be effectively preserved through optimization exclusively on the features themselves.

Consider the task of incrementally learning a new class of objects.¹³¹³13Of course, one may also consider the more general setting where the task contains a small batch of new classes, without serious modification. We denote a corresponding new sample set as $\bm{X}_{new}$ . The features of $\bm{X}_{new}$ are denoted as $\bm{Z}_{new}(\bm{\theta})=f(\bm{X}_{new},\bm{\theta})$ . We concatenate them together with the prototype features of the old classes $\bm{Z}_{old}$ and form $\bm{Z}=[\bm{Z}_{new}(\bm{\theta}),\bm{Z}_{old}]$ . We denote the replayed images from all features as $\hat{\bm{X}}=[{\hat{\bm{X}}_{new}(\bm{\theta},\bm{\eta})},{\hat{\bm{X}}_{old}(\bm{\eta})}]$ although we do not actually need to compute or use them explicitly. We only need features of replayed images, denoted $\hat{\bm{Z}}=f(\hat{\bm{X}},\bm{\theta})=[{\hat{\bm{Z}}_{new}(\bm{\theta},\bm{\eta})},{\hat{\bm{Z}}_{old}(\bm{\theta},\bm{\eta})}]$ .

Mirroring the motivation for the multi-class CTRL objective (5.2.20), we would like the features of the new class $\bm{Z}_{new}$ to be incoherent to all of the old ones $\bm{Z}_{old}$ . As $\bm{Z}_{new}$ is the only new class whose features need to be learned, the objective (5.2.20) reduces to the case where $k=1$ :

\min_{\bm{\eta}}\max_{\bm{\theta}}\Delta{R_{\epsilon}(\bm{Z})}+\Delta{R_{\epsilon}(\hat{\bm{Z}})}+\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}.

(5.3.1)

However, when we update the network parameters $(\bm{\theta},\bm{\eta})$ to optimize the features for the new class, the updated mappings $f$ and $g$ will change features of the old classes too. Hence, to minimize the distortion of the old class representations, we can try to enforce $\mbox{Cov}(\bm{Z}_{j,old})=\mbox{Cov}(\hat{\bm{Z}}_{j,old})$ . In other words, while learning new classes, we enforce that the memory of old classes remains “self-consistent” through the transcription loop:

\bm{Z}_{old}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}}_{old}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\ \hat{\bm{Z}}_{old}.

(5.3.2)

Mathematically, this is equivalent to setting

\Delta R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})\doteq\sum_{j=1}^{t}\Delta R_{\epsilon}(\bm{Z}_{j,old},\hat{\bm{Z}}_{j,old})=0.

Hence, the above minimax program (5.3.1) is revised as a constrained minimax game, which we refer to as incremental closed-loop transcription (i-CTRL). The objective of this game is identical to the standard multi-class CTRL objective (5.2.20), but includes just one additional constraint:

	$\displaystyle\min_{\bm{\eta}}\max_{\bm{\theta}}$		$\displaystyle\Delta{R_{\epsilon}(\bm{Z})}+\Delta{R_{\epsilon}(\hat{\bm{Z}})}+\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}$		(5.3.3)
			$\displaystyle\mbox{subject to}\quad\Delta R(\bm{Z}_{old},\hat{\bm{Z}}_{old})=0.$		(5.3.3)

In practice, the constrained minimax program can be solved by alternating minimization and maximization between the encoder $f(\cdot,\bm{\theta})$ and decoder $g(\cdot,\bm{\eta})$ as follows:

		$\displaystyle\max_{\bm{\theta}}$	$\displaystyle\Delta{R_{\epsilon}(\bm{Z})}\!+\!\Delta{R_{\epsilon}(\hat{\bm{Z}})}\!+\!\lambda\cdot\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}-\gamma\cdot\Delta{R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})},$		(5.3.4)
		$\displaystyle\min_{\bm{\eta}}$	$\displaystyle\Delta{R_{\epsilon}(\bm{Z})}\!+\!\Delta{R_{\epsilon}(\hat{\bm{Z}})}\!+\!\lambda\cdot\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}+\gamma\cdot\Delta{R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})};$		(5.3.5)

where the constraint $\Delta R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})=0$ in (5.3.3) has been converted (and relaxed) to a Lagrangian term with a corresponding coefficient $\gamma$ and sign. We additionally introduce another coefficient $\lambda$ for weighting the rate reduction term associated with the new data.

Jointly optimal memory via incremental reviewing.

As we will see, the above constrained minimax program can already achieve state-of-the-art performance for incremental learning. Nevertheless, developing an optimal memory for all classes cannot rely on graceful forgetting alone. Even for humans, if an object class is learned only once, we should expect the learned memory to fade as we continue to learn new ones, unless the memory can be consolidated by reviewing old object classes.

To emulate this phase of memory forming, after incrementally learning a whole dataset, we may go back to review all classes again, one class at a time. We refer to going through all classes once as one reviewing “cycle”.¹⁴¹⁴14to distinguish from the term “epoch” used in the conventional joint learning setting. If needed, multiple reviewing cycles can be conducted. It is quite expected that reviewing can improve the learned (LDR) memory. But somewhat surprisingly, the closed-loop framework allows us to review even in a “class-unsupervised” manner: when reviewing data of an old class, say $\bm{X}_{j}$ , the system does not need the class label and can simply treat $\bm{X}_{j}$ as a new class $\bm{X}_{new}$ . That is, the system optimizes the same constrained mini-max program (5.3.3) without any modification; after the system is optimized, one can identify the newly learned subspace spanned by $\bm{Z}_{new}$ , and use it to replace or merge with the old subspace $\mathcal{S}_{j}$ . As our experiments show, such a class-unsupervised incremental review process can gradually improve both discriminative and generative performance of the LDR memory, eventually converging to that of a jointly learned memory.

Experimental verification.

We show some experimental results on the following datasets: MNIST [LBB+98a] and CIFAR-10 [KNH14]. All experiments are conducted for the more challenging class-IL setting. For both MNIST and CIFAR-10, the 10 classes are split into 5 tasks with 2 classes each or 10 tasks with 1 class each. For the encoder $f$ and decoder $g$ , we adopt a very simple network architecture modified from DCGAN [RMC16], which is merely a four-layer convolutional network. Here we only show some qualitative visual results; more experiments and analytical analysis can be found in the work [TDW+23].

Visualizing auto-encoding properties.

We begin by qualitatively visualizing some representative images $\bm{X}$ and the corresponding replayed $\hat{\bm{X}}$ on MNIST and CIFAR-10. The model is learned incrementally with the datasets split into 5 tasks. Results are shown in Figure 5.11, where we observe that the reconstructed $\hat{\bm{X}}$ preserves the main visual characteristics of $\bm{X}$ including shapes and textures. For a simpler dataset like MNIST, the replayed $\hat{\bm{X}}$ are almost identical to the input $\bm{X}$ ! This is rather remarkable given: (1) our method does not explicitly enforce $\hat{\bm{x}}\approx\bm{x}$ for individual samples as most autoencoding methods do, and (2) after having incrementally learned all classes, the generator has not forgotten how to generate digits learned earlier, such as 0, 1, 2. For a more complex dataset like CIFAR-10, we also demonstrate good visual quality, faithfully capturing the essence of each image.

Principal subspaces of the learned features.

Most generative memory-based methods utilize autoencoders, VAEs, or GANs for replay purposes. The structure or distribution of the learned features $\bm{Z}_{j}$ for each class is unclear in the feature space. The features $\bm{Z}_{j}$ of the LDR memory, on the other hand, have a clear linear structure. Figure 5.12 visualizes correlations among all learned features $|\bm{Z}^{\top}\bm{Z}|$ , in which we observe clear block-diagonal patterns for both datasets.¹⁵¹⁵15Notice that these patterns closely resemble the similarity matrix of response profiles of object categories from different areas of the inferotemporal cortex, as shown in Extended DataFig.3 of [BSM+20]. This indicates that the features for different classes $\bm{Z}_{j}$ indeed lie on subspaces that are incoherent from one another. Hence, features of each class can be well modeled as a principal subspace in the feature space.

Figure 5.12 : Block diagonal structure of | 𝒁 ⊤ 𝒁 | |\bm{Z}^{\top}\bm{Z}| | bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for MNIST (left) and CIFAR-10 (right). — Figure 5.12: Block diagonal structure of $|\bm{Z}^{\top}\bm{Z}|$ in the feature space for MNIST (left) and CIFAR-10 (right).

Replay images of samples from principal components.

Since features of each class can be modeled as a principal subspace, we further visualize the individual principal components within each of those subspaces. Figure 5.13 shows the images replayed from sampled features along the top-4 principal components for different classes, on MNIST and CIFAR-10 respectively. Each row represents samples along one principal component, and they clearly show similar visual characteristics but are distinctively different from those in other rows. We see that the model remembers different poses of ‘4’ after having learned all remaining classes. For CIFAR-10, the incrementally learned memory remembers representative poses and shapes of horses and ships.

(a) sampled 𝒙 ^ o l d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘4’ — (a) sampled $\hat{\bm{x}}_{old}$ of ‘4’

Effectiveness of incremental reviewing.

We verify how the incrementally learned LDR memory can be further consolidated with an unsupervised incremental reviewing phase described before. Experiments are conducted on CIFAR-10, with 10 steps. Figure 5.14 left shows replayed images of the first class ‘airplane’ at the end of incremental learning of all ten classes, sampled along the top-3 principal components – every two rows (16 images) are along one principal direction. Their visual quality remains very decent—we observe almost no forgetting. The right figure shows replayed images after reviewing the first class once. We notice a significant improvement in visual quality after reviewing, and principal components of the features in the subspace start to correspond to distinctively different visual attributes within the same class.

Figure 5.14 : Visualization of replayed images 𝒙 ^ o l d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of class 1-‘airplane’ in CIFAR-10, before (left) and after (right) one reviewing cycle. — Figure 5.14: Visualization of replayed images $\hat{\bm{x}}_{old}$ of class 1-‘airplane’ in CIFAR-10, before (left) and after (right) one reviewing cycle.

5.3.2 Sample-wise Continuous Unsupervised Learning

As we know, the closed-loop CTRL formulation can already learn a decent autoencoding, even without class information, with the CTRL-Binary program:

\displaystyle\max_{\bm{\theta}}\min_{\bm{\eta}}\quad\Delta R_{\epsilon}(\bm{Z},\hat{\bm{Z}})

(5.3.6)

However, note that (5.3.6) is practically limited because it only aligns the dataset $\bm{X}$ and the regenerated $\hat{\bm{X}}$ at the distribution level. There is no guarantee that each sample $\bm{x}$ would be close to the decoded $\hat{\bm{x}}=g(f(\bm{x}))$ .

Sample-wise constraints for unsupervised transcription.

To improve discriminative and generative properties of representations learned in the unsupervised setting, we propose two additional mechanisms for the above CTRL-Binary maximin game (5.3.6). For simplicity and uniformity, here these will be formulated as equality constraints over rate reduction measures, but in practice they can be enforced softly during optimization.

Sample-wise self-consistency via closed-loop transcription.

First, to address the issue that CTRL-Binary does not learn a sample-wise consistent autoencoding, we need to promote $\hat{\bm{x}}$ to be close to $\bm{x}$ for each sample. In the CTRL framework, this can be achieved by enforcing their corresponding features $\bm{z}=f(\bm{x})$ and $\hat{\bm{z}}=f(\hat{\bm{x}})$ to be close. To promote sample-wise self-consistency, where $\hat{\bm{x}}=g(f(\bm{x}))$ is close to $\bm{x}$ , we want the distance between $\bm{z}$ and $\hat{\bm{z}}$ to be zero or small, for all $N$ samples. This distance can be measured by the rate reduction:

\displaystyle\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})=0.\vspace{-2mm}

(5.3.7)

Note that this again avoids measuring differences in the image space.

Self-supervision via compressing augmented samples.

Since we do not know any class label information between samples in the unsupervised setting, the best we can do is to view every sample and its augmentations (say via translation, rotation, occlusion, etc.) as one “class”—a basic idea behind almost all self-supervised learning methods. In the rate reduction framework, it is natural to compress the features of each sample and its augmentations. In this work, we adopt the standard transformations in SimCLR [CKN+20] and denote such a transformation as $\tau$ . We denote each augmented sample $\bm{x}_{a}=\tau(\bm{x})$ , and its corresponding feature as $\bm{z}_{a}=f(\bm{x}_{a},\bm{\theta})$ . For discriminative purposes, we hope the classifier is invariant to such transformations. Hence it is natural to enforce that the features $\bm{z}_{a}$ of all augmentations are the same as those $\bm{z}$ of the original sample $\bm{x}$ . This is equivalent to requiring the distance between $\bm{z}$ and $\bm{z}_{a}$ , measured in terms of rate reduction again, to be zero (or small) for all $N$ samples:

\displaystyle\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})=0.\vspace{-3mm}

(5.3.8)

Unsupervised representation learning via closed-loop transcription.

So far, we know the CTRL-Binary objective $\Delta R_{\epsilon}(\bm{Z},\hat{\bm{Z}})$ in (5.3.6) helps align the distributions while sample-wise self-consistency (5.3.7) and sample-wise augmentation (5.3.8) help align and compress features associated with each sample. Besides consistency, we also want learned representations to be maximally discriminative for different samples (here viewed as different “classes”). Notice that the rate distortion term $R_{\epsilon}(\bm{Z})$ measures the coding rate (hence volume) of all features.

Unsupervised CTRL.

Putting these elements together, we propose to learn a representation via the following constrained maximin program, which we refer to as unsupervised CTRL (u-CTRL):

	$\displaystyle\max_{\theta}\min_{\eta}\quad$	$\displaystyle R_{\epsilon}(\bm{Z})+\Delta R_{\epsilon}(\bm{Z},\hat{\bm{Z}})$		(5.3.9)
	subject to	$\displaystyle\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})=0,\;\;\mbox{and}\;\;\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})=0.$

Figure 5.15 illustrates the overall architecture of the closed-loop system associated with this program.

Figure 5.15 : Overall framework of closed-loop transcription for unsupervised learning. Two additional constraints are imposed on the Binary-CTRL method: 1) self-consistency for sample-wise features 𝒛 i \bm{z}^{i} bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒛 ^ i \hat{\bm{z}}^{i} over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , say 𝒛 i ≈ 𝒛 ^ i \bm{z}^{i}\approx\hat{\bm{z}}^{i} bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≈ over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; and 2) invariance/similarity among features of augmented samples 𝒛 i \bm{z}^{i} bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒛 a i \bm{z}_{a}^{i} bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , say 𝒛 i ≈ 𝒛 a i = f ( τ ( 𝒙 i ) , 𝜽 ) \bm{z}^{i}\approx\bm{z}_{a}^{i}=f(\tau(\bm{x}^{i}),\bm{\theta}) bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≈ bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( italic_τ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_italic_θ ) , where 𝒙 a i = τ ( 𝒙 i ) \bm{x}^{i}_{a}=\tau(\bm{x}^{i}) bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_τ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is an augmentation of sample 𝒙 i \bm{x}^{i} bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT via some transformation τ ( ⋅ ) \tau(\cdot) italic_τ ( ⋅ ) . — Figure 5.15: Overall framework of closed-loop transcription for unsupervised learning. Two additional constraints are imposed on the Binary-CTRL method: 1) self-consistency for sample-wise features $\bm{z}^{i}$ and $\hat{\bm{z}}^{i}$ , say $\bm{z}^{i}\approx\hat{\bm{z}}^{i}$ ; and 2) invariance/similarity among features of augmented samples $\bm{z}^{i}$ and $\bm{z}_{a}^{i}$ , say $\bm{z}^{i}\approx\bm{z}_{a}^{i}=f(\tau(\bm{x}^{i}),\bm{\theta})$ , where $\bm{x}^{i}_{a}=\tau(\bm{x}^{i})$ is an augmentation of sample $\bm{x}^{i}$ via some transformation $\tau(\cdot)$ .

In practice, the above program can be optimized by alternating maximization and minimization between the encoder $f(\cdot,\bm{\theta})$ and the decoder $g(\cdot,\bm{\eta})$ . We adopt the following optimization strategy that works well in practice, which is used for all subsequent experiments on real image datasets:

	$\displaystyle\max_{\theta}\;R_{\epsilon}(\bm{Z})+\Delta{R_{\epsilon}(\bm{Z},\hat{\bm{Z}})-\lambda_{1}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})}-\lambda_{2}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i});$		(5.3.10)
	$\displaystyle\min_{\eta}\;R_{\epsilon}(\bm{Z})+\Delta{R_{\epsilon}(\bm{Z},\hat{\bm{Z}})+\lambda_{1}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})+\lambda_{2}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})},$		(5.3.11)

where the constraints $\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})=0$ and $\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})=0$ in (5.3.9) have been converted (and relaxed) to Lagrangian terms with corresponding coefficients $\lambda_{1}$ and $\lambda_{2}$ .¹⁶¹⁶16Notice that computing the rate reduction terms $\Delta R$ for all samples or a batch of samples requires computing the expensive $\log\det$ of large matrices. In practice, from the geometric meaning of $\Delta R$ for two vectors, $\Delta R$ can be approximated with an $\ell^{2}$ norm or the cosine distance between two vectors.

The above representation is learned without class information. In order to facilitate discriminative or generative tasks, it must be highly structured. It has been verified experimentally that this is indeed the case and u-CTRL demonstrates significant advantages over other incremental or unsupervised learning methods [TDC+24]. We here only illustrate some qualitative results with the experiment on the CIFAR-10 dataset [KNH14], with standard augmentations for self-supervised learning [CKN+20]. One may refer to [TDC+24] for experiments on more and larger datasets and their quantitative evaluations.

As one can see from the experiments, specific and unique structure indeed emerges naturally in the representations learned using u-CTRL: globally, features of images in the same class tend to be clustered well together and separated from other classes, as shown in Figure 5.16; locally, features around individual samples exhibit approximately piecewise linear low-dimensional structures, as shown in Figure 5.17.

Figure 5.16 : Emergence of block-diagonal structures of | 𝒁 ⊤ 𝒁 | |\bm{Z}^{\top}\bm{Z}| | bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for CIFAR-10. — Figure 5.16: Emergence of block-diagonal structures of $|\bm{Z}^{\top}\bm{Z}|$ in the feature space for CIFAR-10.

Unsupervised conditional image generation via rate reduction.

The highly-structured feature distribution also suggests that the learned representation can be very useful for generative purposes. For example, we can organize the sample features into meaningful clusters, and model them with low-dimensional (Gaussian) distributions or subspaces. By sampling from these compact models, we can conditionally regenerate meaningful samples from computed clusters. This is known as unsupervised conditional image generation [HKJ+21].

To cluster features, we exploit the fact that the rate reduction framework (3.4.12) is inspired by unsupervised clustering via compression [MDH+07a], which provides a principled way to find the membership $\bm{\Pi}$ . Concretely, we maximize the same rate reduction objective (3.4.12) over $\bm{\Pi}$ , but fix the learned representation $\bm{Z}$ instead. We simply view the membership $\bm{\Pi}$ as a nonlinear function of the features $\bm{Z}$ , say $h_{\bm{\pi}}(\cdot,\xi):\bm{Z}\mapsto\bm{\Pi}$ with parameters $\xi$ . In practice, we model this function with a simple neural network, such as an MLP head right after the output feature $\bm{z}$ . To estimate a “pseudo” membership $\hat{\bm{\Pi}}$ of the samples, we solve the following optimization problem over $\bm{\Pi}$ :

\displaystyle\hat{\bm{\Pi}}=\arg\max_{\xi}\Delta R_{\epsilon}(\bm{Z}|\bm{\Pi}(\xi)).

(5.3.12)

In Figure 5.18, we visualize images generated from the ten unsupervised clusters from (5.3.12). Each block represents one cluster and each row represents one principal component for each cluster. Despite learning and training without labels, the model not only organizes samples into correct clusters, but is also able to preserve statistical diversities within each cluster/class. We can easily recover the diversity within each cluster by computing different principal components and then sampling and generating accordingly. While the experiments presented here are somewhat limited in scale, we will explore more direct and powerful methods that utilize the learned data distributions and representations for conditional generation and estimation in the next chapter.

Figure 5.18 : Unsupervised conditional image generation from each cluster of CIFAR-10, using u-CTRL. Images from different rows mean generation from different principal components of each cluster. — Figure 5.18: Unsupervised conditional image generation from each cluster of CIFAR-10, using u-CTRL. Images from different rows mean generation from different principal components of each cluster.

5.4 Summary and Notes

Historically, autoencoding has been one of the important drivers of research innovation in neural networks for learning, although the most practically impressive demonstrations of deep learning have probably been in other domains (such as discriminative classification, with AlexNet [KSH12], or generative modeling with GPT architectures [BMR+20]). Works we have featured throughout the chapter, especially the work of [HS06], served as catalysts of research interest in neural networks during times when they were otherwise not prominent in the machine learning research landscape. In modern practice, autoencoders remain core components of many large-scale systems for generating highly structured data such as visual data, speech data, and molecular data, especially the VQ-VAE approach [OVK17], which builds on the variational autoencoder methodology we discussed in Section 5.1.4. The core problem of autoencoding remains of paramount intellectual importance due to its close connection with representation learning, and we anticipate that it will reappear on the radar of practical researchers in the future as efficiency in training and deploying large models continues to become more important.

Materials presented in the second half of this chapter are based on a series of recent work on the topic of closed-loop transcription: [DTL+22], [PPC+23], [TDW+23], and [TDC+24]. In particular, Section 5.2.1 is based on the pioneering work of [DTL+22]. After that, the work of [PPC+23] has provided strong theoretical justifications for the closed-loop framework, at least for an ideal case. Section 5.3.1 and Section 5.3.2 are based on the works of [TDW+23] and [TDC+24], respectively. They demonstrate that the closed-loop framework naturally supports incremental and continuous learning, either in a class-wise or sample-wise setting. The reader may refer to these papers for more technical and experimental details.

Shallow vs. deep neural networks, for autoencoding and more.

In Section 5.1.2, we discussed Cybenko’s universal approximation theorem and how it states that in principle, a neural network with a single hidden layer (and suitable elementwise nonlinearities) is sufficient to approximate any suitably regular target function. Of course, in practice, the major architectural reason for the dominance of neural networks in practice has been the refinement of techniques for training deeper neural networks. Why is depth necessary? From a fundamental point of view, the issue of depth separations, which construct settings where a deeper neural network can approximate a given class of target functions with exponentially-superior efficiency relative to a shallow network, has been studied at great length in the theoretical literature: examples include [Tel16, BN20, VJO+21]. The ease of training deeper networks in practice has not received as satisfying an answer from the perspective of theory. ResNets [HZR+16a] represent the pioneering empirical work of making deeper networks more easily trainable, used in nearly all modern architectures in some form. Theoretical studies have focused heavily on trainability of very deep networks, quantified via the initial neural tangent kernel [BGW21, MBD+21], but these studies have not given significant insight into the trainability benefits of deeper networks in middle and late stages of training (but see [YH21] for the root of a line of research attempting to address this).

5.5 Exercises and Extensions

Exercise 5.1 (Conceptual Understanding of Manifold Flattening).

Consider data lying on a curved manifold $\mathcal{M}$ embedded in $\mathbb{R}^{D}$ (like a curved surface in $3$ -dimensional space), as discussed in the manifold flattening subsection of Section 5.1.2. In this exercise, we will describe the basic ingredients of the manifold flattening algorithm from [PPR+24]. A manifold is called flat if it is an open set in Euclidean space (or more generally, an open set in a subspace).

Suppose the manifold $\mathcal{M}$ is a graph: this means that $\mathcal{M}\subset\mathbb{R}^{D_{1}}\times\mathbb{R}^{D_{2}}$ (say), and that there is a function $F:\mathbb{R}^{D_{1}}\to\mathbb{R}^{D_{2}}$ such that

\mathcal{M}=\{(\bm{x},F(\bm{x})\mid\bm{x}\in\mathbb{R}^{D_{1}})\}.

(5.5.1)

Give an integer $D_{3}$ and a map $f:\mathbb{R}^{D_{1}}\times\mathbb{R}^{D_{2}}\to\mathbb{R}^{D_{3}}$ that flattens $\mathcal{M}$ , and describe the corresponding (lossless) reconstruction procedure from the flattened representation.

2.

Now suppose that $\mathcal{M}$ is a general smooth manifold. Smooth manifolds have the property that they are locally well-approximated by subspaces near each point. Describe in intuitive terms how to flatten the manifold $\mathcal{M}$ locally at each point, by relating it to a graph. (The algorithm of [PPR+24] performs a refined version of this local flattening process in a way that allows them to be glued together to form a global flattening.)

Exercise 5.2 (Reproduce Closed-Loop Transcription).

Implement a closed-loop transcription pipeline for representation learning on the CIFAR-10 dataset following the methodology in Section 5.2.1. Reference [DTL+22] for useful hyperparameters and architecture settings. Reproduce the result in Figure 5.8.