Chapter 5 Consistent and Self-Consistent Representations

Everything should be made as simple as possible, but not any simpler.”

  – Albert Einstein

In the past chapters, we have established a basic fact that the fundamental goal of learning is to learn a data distribution with low-dimensional support and transform it into a compact and structured representation. Such a representation reveals intrinsic low-dimensional structures of the data distribution and facilitates subsequent tasks such as classification and generation.

A fundamental approach to learning a good representation of such a distribution is through compression. To make the goal of compression measurable and computable, it can be done explicitly by learning a coding scheme that minimizes the coding rate (entropy) or maximizes the information gain (coding rate reduction). In this context, the fundamental role of a deep neural network is to realize a certain iterative optimization algorithm that incrementally optimizes the learned representations in terms of those measures:

f:𝑿f0𝒁0𝒁f𝒁+1𝒁L=𝒁.f\colon\bm{X}\xrightarrow{\hskip 2.84526ptf^{0}\hskip 2.84526pt}\bm{Z}^{0}\rightarrow\cdots\rightarrow\bm{Z}^{\ell}\xrightarrow{\hskip 2.84526ptf^{\ell}\hskip 2.84526pt}\bm{Z}^{\ell+1}\rightarrow\cdots\to\bm{Z}^{L}=\bm{Z}.italic_f : bold_italic_X start_ARROW start_OVERACCENT italic_f start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_Z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT → ⋯ → bold_italic_Z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_f start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_Z start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT → ⋯ → bold_italic_Z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = bold_italic_Z . (5.0.1)

In the preceding chapter, we have shown that main architectural characteristics of almost all popular deep networks (ResNet, CNN, and Transformer) can be derived and interpreted from this perspective.

However, when we try to achieve a certain objective through optimization, there is no guarantee that the solution 𝒁\bm{Z}bold_italic_Z found in the end by incremental optimization is the correct solution. In fact, even if the optimization process manages to find the globally optimal solution 𝒁\bm{Z}^{*}bold_italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, there is no guarantee that the solution corresponds to a complete representation of the data distribution.111This could be due to many reasons: for example, the data available for learning the distribution might not be sufficient, or the formulation of the optimization program fails to consider some additional constraints or conditions. Therefore, an outstanding question is: how can we ensure that the learned representation of the data distribution is correct or good enough?

Of course, the only way to verify this is to see whether there exists a decoding map, say ggitalic_g, that can decode the learned representation to reproduce the original data (distribution) well enough:

𝑿=f𝒁𝒟=g𝑿^\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}bold_italic_X start_ARROW start_OVERACCENT caligraphic_E = italic_f end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D = italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG (5.0.2)

in terms of some measure of similarity:

d(𝑿,𝑿^).d(\bm{X},\hat{\bm{X}}).italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) . (5.0.3)

This leads to the concept of consistent representation. As we have briefly alluded to in Chapter 1 (see Figure 1.23), autoencoding, which integrates the encoding and decoding processes, is a natural framework for learning such a representation. We have studied some important special cases in Chapter 2. In this chapter, Section 5.1, we will study how to extend autoencoding to more general classes of distributions by enforcing the consistency between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG.

In many practical and natural learning scenarios, it can be difficult or even impossible to compare distributions of the data 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG. We are left with the only option of comparing the learned feature 𝒁\bm{Z}bold_italic_Z with its image 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG under the encoder ffitalic_f:

𝑿=f𝒁𝒟=g𝑿^=f𝒁^?\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\hat{\bm{Z}}?bold_italic_X start_ARROW start_OVERACCENT caligraphic_E = italic_f end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D = italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG start_ARROW start_OVERACCENT caligraphic_E = italic_f end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_Z end_ARG ? (5.0.4)

This leads to the notion of a self-consistent representation. In Section 5.2, we will study when and how we can learn a consistent representation by enforcing the self-consistency between 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG only through a closed-loop transcription framework.

Furthermore, in many practical and natural learning scenarios, we normally do not have sufficient samples of the data distribution all available at once. For example, animals and humans develop their visual memory by continuously taking in increments of observations throughout their lives. In Section 5.3, we will study how to extend the closed-loop transcription framework to learn a self-consistent representation in a continuous learning setting.

Of course, a fundamental motivation for why we ever want to identify the low-dimensional structures in a data distribution and find a good representation is to make it easy to use the data for various tasks of intelligence, such as classification, completion, and prediction. Therefore, the resulting joint representation (𝒙,𝒛)(\bm{x},\bm{z})( bold_italic_x , bold_italic_z ) must be structured in such a way that is best suited for these tasks. In the next chapter, we will see how the learned representation can be structured to facilitate conditioned completion or generation tasks.

5.1 Learning Consistent Representations

Here we give a formal definition of consistent representations, which are closely related to the concept of autoencoding.

Definition 5.1 (Consistent Representations).

Given data 𝑿\bm{X}bold_italic_X, a consistent representation is a pair of functions (f:𝒳𝒵,g:𝒵𝒳)(f\colon\mathcal{X}\to\mathcal{Z},g\colon\mathcal{Z}\to\mathcal{X})( italic_f : caligraphic_X → caligraphic_Z , italic_g : caligraphic_Z → caligraphic_X ), such that the features 𝒁=f(𝑿)\bm{Z}=f(\bm{X})bold_italic_Z = italic_f ( bold_italic_X ) are compact and structured, and the autoencoding

𝑿^g(𝒁)=g(f(𝑿))\hat{\bm{X}}\doteq g(\bm{Z})=g(f(\bm{X}))over^ start_ARG bold_italic_X end_ARG ≐ italic_g ( bold_italic_Z ) = italic_g ( italic_f ( bold_italic_X ) ) (5.1.1)

is close to 𝑿\bm{X}bold_italic_X according to either the following two measures:

  1. 1.

    We say that it is sample-wise consistent if 𝑿𝑿^\bm{X}\approx\hat{\bm{X}}bold_italic_X ≈ over^ start_ARG bold_italic_X end_ARG in a certain norm with high probability.

  2. 2.

    We say that the representation is distributionally consistent if Law(𝑿)Law(𝑿^)\operatorname{Law}(\bm{X})\approx\operatorname{Law}(\hat{\bm{X}})roman_Law ( bold_italic_X ) ≈ roman_Law ( over^ start_ARG bold_italic_X end_ARG ).

Astute readers may have noticed that if we do not impose certain requirements on the representation 𝒁\bm{Z}bold_italic_Z sought, the above problem has a trivial solution: one may simply choose the functions ffitalic_f and ggitalic_g to be the identity map! Hence, the true purpose of seeking an autoencoding is to ensure that the 𝒁\bm{Z}bold_italic_Z so obtained is both more compact and more structured than 𝑿\bm{X}bold_italic_X. First, for compactness, 𝒁\bm{Z}bold_italic_Z should better reveal the intrinsic low-dimensionality of 𝑿\bm{X}bold_italic_X. Therefore, the representation should maximize a certain information gain, say, measured by the rate reduction

ΔRϵ(𝒁)\Delta R_{\epsilon}(\bm{Z})roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) (5.1.2)

introduced in Section 3.4.2. Second, the main purpose of learning a good representation of the data distribution is to facilitate tasks that exploit the low-dimensionality of its distribution. Hence, the distribution of 𝒁\bm{Z}bold_italic_Z should be better structured. For example, the distribution of 𝒁\bm{Z}bold_italic_Z is piecewise linear or Gaussian, and its components are largely incoherent or independent. These independent components can represent different clusters or classes and can also easily be used as conditions for decoding the corresponding data 𝒙\bm{x}bold_italic_x.

From the definition of consistent representation, it is required that the representation 𝒁\bm{Z}bold_italic_Z be sufficient to recover the original data distribution 𝑿\bm{X}bold_italic_X to some degree of accuracy. For sample-wise consistency, a typical choice is to minimize the expected reconstruction error:

d(𝑿,𝑿^)=𝔼[𝑿𝑿^22].d(\bm{X},\hat{\bm{X}})=\mathbb{E}[\|\bm{X}-\hat{\bm{X}}\|_{2}^{2}].italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) = blackboard_E [ ∥ bold_italic_X - over^ start_ARG bold_italic_X end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (5.1.3)

For consistency in distribution, a typical choice is to minimize a certain distributional distance such as the KL divergence222Note that for distributions without common support, which is typical for degenerate distributions, KL divergence may not even be well-defined. In fact, much of the distribution learning literature is trying to address this technical difficulty by replacing or approximating it with something well-defined and efficiently computable.:

d(𝑿,𝑿^)=𝒟KL(𝑿𝑿^).d(\bm{X},\hat{\bm{X}})=\mathcal{D}_{KL}(\bm{X}\|\hat{\bm{X}}).italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) = caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( bold_italic_X ∥ over^ start_ARG bold_italic_X end_ARG ) . (5.1.4)

Hence, computation aside, when we seek a good autoencoding for a data distribution 𝑿\bm{X}bold_italic_X, conceptually we try to find an encoder ffitalic_f and decoder ggitalic_g such that

minf,g[ΔRϵ(𝒁)+d(𝑿,𝑿^)].\min_{f,g}[-\Delta R_{\epsilon}(\bm{Z})+d(\bm{X},\hat{\bm{X}})].roman_min start_POSTSUBSCRIPT italic_f , italic_g end_POSTSUBSCRIPT [ - roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) ] . (5.1.5)

For the rest of this chapter, we will study how to solve such autoencoding problems under different conditions, from simple and ideal cases to increasingly more challenging and realistic conditions.

5.1.1 Linear Autoencoding via PCA

According to [Bal11], the phrase “autoencoder” was first introduced by Hinton and Rumelhart [RHW86] so that a deep representation could be learned via back propagation (BP) in a self-supervised fashion—reconstructing the original data is the self-supervising task. However, the very same concept of seeking a compact and consistent representation has its roots in many classic studies. As we have already seen in Chapter 2, the classical PCA, ICA, and sparse dictionary learning all share a similar goal. The only difference is that when the underlying data distribution is simple (linear and independent), the encoding or decoding mappings become easy to represent and learn: they do not need to be deep and often can be computed in closed form or with an explicit algorithm.

It is instructive to see how the notion of consistency we have defined plays out in the simple case of PCA: here, the consistent encoding and decoding mappings are given by a single-layer linear transform:

𝑿=𝑼𝒁𝒟=𝑼𝑿^,\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}=\bm{U}^{\top}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\mathcal{D}=\bm{U}\hskip 5.69054pt}\hat{\bm{X}},bold_italic_X start_ARROW start_OVERACCENT caligraphic_E = bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D = bold_italic_U end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG , (5.1.6)

where 𝑼D×d\bm{U}\in\mathbb{R}^{D\times d}bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT typically with dDd\ll Ditalic_d ≪ italic_D. Hence 𝑼\bm{U}^{\top}bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT represents a projection from a higher-dimensional space D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT to a lower one d\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, as illustrated in Figure 5.1.

Figure 5.1 : Illustration of a typical autoencoder such as PCA, seeking a low-dimensional representation 𝒛 \bm{z} bold_italic_z of high-dimensional data 𝒙 \bm{x} bold_italic_x .
Figure 5.1: Illustration of a typical autoencoder such as PCA, seeking a low-dimensional representation 𝒛\bm{z}bold_italic_z of high-dimensional data 𝒙\bm{x}bold_italic_x.

As we saw in Chapter 2, when the distribution of 𝒙\bm{x}bold_italic_x is indeed supported on a low-dimensional subspace 𝑼o\bm{U}_{o}bold_italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the compactness of the representation 𝒛\bm{z}bold_italic_z produced by \mathcal{E}caligraphic_E is a direct consequence of correctly estimating (and enforcing) the dimension of this subspace. Finally, recall that gradient descent on the reconstruction criterion exactly yields these sample-wise consistent mappings: indeed, the optimal solution to the problem

min𝑼𝑼=𝑰𝔼𝒙[𝒙𝑼𝑼𝒙22]\min_{\bm{U}^{\top}\bm{U}=\bm{I}}\,\mathbb{E}_{\bm{x}}\left[\left\|\bm{x}-\bm{U}\bm{U}^{\top}\bm{x}\right\|_{2}^{2}\right]roman_min start_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_U = bold_italic_I end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ ∥ bold_italic_x - bold_italic_U bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (5.1.7)

precisely coincides with 𝑼o\bm{U}_{o}bold_italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT when the dimension of the representation is sufficiently large. In this case, we obtain sample-wise consistency for free, since this guarantees that 𝑼o𝑼o𝒙=𝒙\bm{U}_{o}\bm{U}_{o}^{\top}\bm{x}=\bm{x}bold_italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT bold_italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = bold_italic_x. Notice that in the case of PCA, the rate reduction term in (5.1.5) becomes void as the regularization on the representation 𝒁\bm{Z}bold_italic_Z sought is explicit: It spans the entire subspace of lower dimension333One may view ΔR=0\Delta R=0roman_Δ italic_R = 0 in this case..

Online PCA.

Notice that in the above construction, the linear transform 𝑼\bm{U}bold_italic_U used for encoding and decoding is computed “offline” from all the input data beforehand. One question is whether this transform can be learned “online” as the input data comes in order. This question was answered by the work of Oja in 1982 [Oja82].

Example 5.1 (Normalized Hebbian learning scheme for PCA).

Consider a sequence of i.i.d. random vectors 𝒙1,,𝒙i,n\bm{x}_{1},\ldots,\bm{x}_{i},\ldots\in\mathbb{R}^{n}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with covariance 𝚺n×n\bm{\Sigma}\in\mathbb{R}^{n\times n}bold_Σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT. Let 𝒖0n\bm{u}_{0}\in\mathbb{R}^{n}bold_italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and define the response of an input vector 𝒙i\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT against a weight vector 𝒖i\bm{u}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be their inner product:

ηi=𝒖iT𝒙i\eta_{i}=\bm{u}_{i}^{T}\bm{x}_{i}italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (5.1.8)

and we update the weight vector according to the following scheme:

𝒖i+1=𝒖i+γηi𝒙i𝒖i+γηi𝒙i2\bm{u}_{i+1}=\frac{\bm{u}_{i}+\gamma\eta_{i}\bm{x}_{i}}{\|\bm{u}_{i}+\gamma\eta_{i}\bm{x}_{i}\|_{2}}bold_italic_u start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = divide start_ARG bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_γ italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_γ italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (5.1.9)

for some small gain γ>0\gamma>0italic_γ > 0. This update scheme can be viewed as a normalized Hebbian scheme, in which the weights of connections between neurons become stronger if (products of) both the input 𝒙\bm{x}bold_italic_x and output η\etaitalic_η are strong. One may view the vector of weights 𝒖\bm{u}bold_italic_u as “learned” based on a form of feedback from the output η\etaitalic_η. Then, under reasonable assumptions, Oja [Oja82] has shown that 𝒖i\bm{u}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT converges to the eigenvector associated with the largest eigenvalue of 𝚺\bm{\Sigma}bold_Σ. \blacksquare

The normalized Hebbian scheme (5.1.9) can be interpreted as a first-order approximation to a stochastic projected gradient descent scheme on the objective of the problem (5.1.7) (with batch size 111, and with the number of columns of 𝑼\bm{U}bold_italic_U equal to 111) as long as 𝒖2=1\|\bm{u}\|_{2}=1∥ bold_italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, which is maintained by the projection operation in (5.1.9). It is worth keeping its existence in the back of one’s mind, both as a proof of correctness for the use of stochastic gradient methods for optimizing reconstruction costs such as (5.1.7), and for its suggestion that simpler algorithms than (end-to-end) back propagation can succeed in learning consistent autoencoders.

𝒙\bm{x}bold_italic_xffitalic_f𝒛\bm{z}bold_italic_zggitalic_g𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG
Figure 5.2: A depiction of interpolation through manifold flattening on a manifold in 3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT of dimension d=2d=2italic_d = 2. To interpolate two points on the data manifold, map them through the flattening map ffitalic_f to the flattened space, take their convex interpolants, and then map them back to the data manifold through the reconstruction map ggitalic_g.

5.1.2 Nonlinear PCA and Autoencoding

Of course, one should expect that things will no longer be so simple when we deal with more complex distributions whose underlying low-dimensional structures are nonlinear.

Data on a Nonlinear Submanifold.

So, to move beyond the linear structure addressed by PCA, we may assume that the data distribution lies on a (smooth) submanifold \mathcal{M}caligraphic_M. The intrinsic dimension of the submanifold, say dditalic_d, is typically much lower than the dimension of the ambient space D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. From this geometric perspective, we typically want to find a nonlinear mapping ffitalic_f such that the resulting manifold f()f(\mathcal{M})italic_f ( caligraphic_M ) is flattened, as illustrated by the example shown in Figure 5.2. The resulting feature 𝒛\bm{z}bold_italic_z-space is typically more compact (of lower-dimension) than the 𝒙\bm{x}bold_italic_x-space, and the manifold is flat. From the statistical perspective, which is complementary to the geometric perspective but distinct in general, we may also want to ensure that the data distribution on \mathcal{M}caligraphic_M is mapped to a sufficiently regular distribution, say a Gaussian or uniform distribution (with very low-dimensional support), in the 𝒛\bm{z}bold_italic_z-space. These two properties ensure that sampling and interpolation in the 𝒛\bm{z}bold_italic_z-space are as easy as possible, and they are mathematical formalizations of the desirable notions of compact and structured features in the low-dimensional manifold model for the data distribution. In general, the problem of learning such an autoencoding mapping for this class of data distributions is known as nonlinear principal component analysis (NLPCA).

A Classical Attempt via a Two-Layer Network.

As we have seen above, in the case of PCA, a single-layer linear neural network is sufficient. That is no longer the case for NLPCA. In 1991, Kramer [Kra91] proposed to solve NLPCA by using a two-layer neural network to represent the encoder mapping ffitalic_f (or its inverse ggitalic_g) based on the universal representation property of two-layer networks with sigmoid activation:

𝒛=𝑾2σ(𝑾1𝒙+𝒃),\bm{z}=\bm{W}_{2}\sigma(\bm{W}_{1}\bm{x}+\bm{b}),bold_italic_z = bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_x + bold_italic_b ) , (5.1.10)

where σ()\sigma(\,\cdot\,)italic_σ ( ⋅ ) is the sigmoid function:

σ(x)=11+ex.\sigma(x)=\frac{1}{1+e^{-x}}.italic_σ ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG . (5.1.11)

Cybenko [Cyb89] showed that functions of the above form (with enough hidden nodes) can approximate any smooth nonlinear function, say the encoder f()f(\,\cdot\,)italic_f ( ⋅ ), to an arbitrary precision. In particular, they can represent the flattening and reconstruction maps for data distributions supported on unions of low-dimensional manifolds, as in Figure 5.2. The overall architecture of the original networks proposed by Kramer is illustrated in Figure 5.3.

Figure 5.3 : Nonlinear PCA by autoassociative neural networks of depth two for both the encoding and decoding mappings, suggested by Kramer [ Kra91 ] .
Figure 5.3: Nonlinear PCA by autoassociative neural networks of depth two for both the encoding and decoding mappings, suggested by Kramer [Kra91].

Unfortunately, unlike the case of PCA above, there is in general no closed-form learning scheme for the parameters 𝜽=(𝑾,𝒃)\bm{\theta}=(\bm{W},\bm{b})bold_italic_θ = ( bold_italic_W , bold_italic_b ) of these networks. Hence it was proposed to train the network via back propagation with the supervision of reconstruction error:

min𝜽𝔼[𝒙𝒙^(𝜽)22].\min_{\bm{\theta}}\mathbb{E}[\|\bm{x}-\hat{\bm{x}}(\bm{\theta})\|_{2}^{2}].roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_x - over^ start_ARG bold_italic_x end_ARG ( bold_italic_θ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (5.1.12)

Compared to the simple case of PCA, we utilize the same reconstruction objective for learning, but a far more complex nonlinear class of models for parameterizing and learning the encoder and decoder. Although universal approximation properties such as Cybenko’s suggest that in principle learning consistent autoencoders via this framework is possible—because for any random sample of data, given enough parameters, such autoencoding pairs exist—one often finds it highly nontrivial to find them using gradient descent. Moreover, to obtain a sufficiently informative reconstruction objective and model for the distribution of high-dimensional real-world data such as images, the required number of samples and hidden nodes can be huge. In addition, as a measure of the compactness of the learned representation, the (lower) dimension of 𝒛\bm{z}bold_italic_z for the bottleneck layer is often chosen heuristically.444In the later work [HZ93], Hinton et al. suggested to use the minimum description length (MDL) principle to promote the compactness of the learned coding scheme, in a spirit very similar to the rate distortion measure introduced in this book.

Manifold Flattening via a Deeper Network.

Based on the modern practice of deep networks, such classical shallow and wide network architectures are known to be rather difficult to train effectively and efficiently via back propagation (BP), partly due to the vanishing gradient of the sigmoid function. Hence, the modern practice normally suggests further decomposing the nonlinear transform ffitalic_f (or ggitalic_g) into a composition of many more layers of simpler transforms, resulting in a deeper network architecture [HS06], as illustrated in Figure 5.4. In the modern context, further elaborations over the basic reconstruction cost (5.1.12) also prove necessary to achieve good performance on complex real-world data distributions such as images.

𝒙\bm{x}bold_italic_xf1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTf2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT\cdotsfLf_{L}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPTgLg_{L}italic_g start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT\cdotsg2g_{2}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTg1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝒛\bm{z}bold_italic_z
Figure 5.4: A depiction of the construction process of the flattening and reconstruction pair (f,g)(f,g)( italic_f , italic_g ), where the encoder f=fLfL1f1f=f_{L}\circ f_{L-1}\circ\cdots\circ f_{1}italic_f = italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is constructed from composing flattening layers, and the decoder g=g1g2gLg=g_{1}\circ g_{2}\circ\cdots\circ g_{L}italic_g = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_g start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is composed of inversions of each ff_{\ell}italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT.

In light of universal approximation theorems such as Cybenko’s, one may initially wonder why, conceptually, deeper autoencoders should be preferred over shallow ones. From purely an expressivity perspective, we can understand this phenomenon through a geometric angle related to the task of flattening the nonlinear manifold on which our hypothesized data distribution is supported. A purely constructive approach to flattening the manifold proceeds incrementally, in parallel to what we have seen in Chapters 3 and 4 with the interaction between diffusion, denoising, and compression. In the geometric setting, the incremental flattening process corresponding to ff_{\ell}italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT takes the form of transforming a neighborhood of one point of the manifold to be closer to a flat manifold (i.e., a subspace), and enforcing local consistency with the rest of the data samples; the corresponding incremental operation in the decoder, gg_{\ell}italic_g start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, undoes this transformation. This procedure precisely incorporates curvature information about the underlying manifold, which is estimated from data samples. Given enough samples from the manifold and a careful instantiation of this conceptual process, it is possible to implement this procedure as a computational procedure that verifiably flattens nonlinear manifolds in a white-box fashion [PPR+24]. However, the approach is limited in its applicability to high-dimensional data distributions such as images due to unfavorable scalability, motivating the development of more flexible methods to incremental autoencoding.

5.1.3 Sparse Autoencoding

In the above autoencoding schemes, the dimension of the feature space dditalic_d is typically chosen to be much lower than that of the original data space DDitalic_D in order to explicitly enforce or promote the learned representation to be low-dimensional. However, in practice, we normally do not know the intrinsic dimension of the data distribution. Hence, the choice of the feature space dimension for autoencoding is often done empirically. In more general situations, the data distribution can be a mixture of a few low-dimensional subspaces or submanifolds. In these cases, it is no longer feasible to enforce a single low-dimensional space for all features together.

The sparse autoencoder is meant to resolve some of these limitations. In particular, the dimension of the feature space can be equal to or even higher than that of the data space, as illustrated in Figure 5.5. However, the features are required to be highly sparse in the feature space. So if we impose sparsity as a measure of parsimony in addition to rate reduction in the objective (5.1.5), we obtain a new objective for sparse autoencoding:

minf,g[𝒁0ΔRϵ(𝒁)+d(𝑿,𝑿^)],\min_{f,g}[\|\bm{Z}\|_{0}-\Delta R_{\epsilon}(\bm{Z})+d(\bm{X},\hat{\bm{X}})],roman_min start_POSTSUBSCRIPT italic_f , italic_g end_POSTSUBSCRIPT [ ∥ bold_italic_Z ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) ] , (5.1.13)

where the 0\ell^{0}roman_ℓ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT-“norm” 0\|\,\cdot\,\|_{0}∥ ⋅ ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is known to promote sparsity. This is very similar to the sparse rate reduction objective (4.2.4) which we used in the previous Chapter 4 to derive the white-box CRATE architecture.

Figure 5.5 : Illustration of a sparse autoencoder (SAE), compared to that of a typical autoencoder (AE) in Figure 5.1 .
Figure 5.5: Illustration of a sparse autoencoder (SAE), compared to that of a typical autoencoder (AE) in Figure 5.1.

As a method for learning autoencoding pairs in an end-to-end fashion, sparse autoencoding has been practiced in the past [RPC+06, LRM+12], but nearly all modern autoencoding frameworks are instead based on a different, probabilistic autoencoding framework, which we will study now.

5.1.4 Variational Autoencoding

In the classical conception of autoencoding, following Hinton and Rumelhart [RHW86], the data distribution plays a very minor role in the formulation, despite its centrality to the representation we ultimately learn. Indeed, in the naive framework, one hopes that by training a deep network to reconstruct samples from the data distribution with a suitably configured bottleneck for the representation 𝒛\bm{z}bold_italic_z, the learned encoders ffitalic_f and ggitalic_g will naturally end up corresponding to a compact and structured feature representation for the data. This is rarely the case in practice. An improved, more modern methodology for autoencoding that still finds significant application to this day is variational autoencoding [KW13, KW19]. We will see how this framework, which trains a variational autoencoder (VAE) derived through probabilistic modeling considerations, both generalizes the classical autoencoder training (via minimization of the reconstruction loss), and stabilizes it with appropriate regularization. Later, we will see how to go further and go beyond the black-box nature of the deep networks used to represent the encoding and decoding mappings ffitalic_f and ggitalic_g.

Probabilistic Perspective on Autoencoding

In the manifold model for the data distribution, the key mathematical objects are the support of the data distribution, namely the manifold \mathcal{M}caligraphic_M, and the density of the data on the support, say ppitalic_p. When we formulate autoencoding from the probabilistic perspective, we often think of the high-dimensional input 𝒙\bm{x}bold_italic_x as having a density ppitalic_p with support on D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT; one can think of adding a very small amount of noise to the (degenerate) distribution supported on the manifold \mathcal{M}caligraphic_M to obtain this density ppitalic_p, in line with our denoising-diffusion constructions in Chapter 3. Then the goal of generative probabilistic modeling is to learn the density ppitalic_p from samples 𝒙\bm{x}bold_italic_x, say from a class of models p(𝒙;𝜽)p(\bm{x};\,\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ) parameterized by 𝜽\bm{\theta}bold_italic_θ. As we have recalled in Chapter 3, a classical approach to achieving this is via maximum likelihood estimation:

max𝜽𝔼𝒙[logp(𝒙;𝜽)].\max_{\bm{\theta}}\,\mathbb{E}_{\bm{x}}[\log p(\bm{x};\,\bm{\theta})].roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ roman_log italic_p ( bold_italic_x ; bold_italic_θ ) ] .

For certain representative classes of data distributions ppitalic_p and sufficiently expressive classes of models p(𝒙;𝜽)p(\bm{x};\,\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ), even simpler learning problems than maximum likelihood estimation are known to be statistically hard [YB99]. Hence it is desirable to exploit the knowledge that 𝒙\bm{x}bold_italic_x has low-dimensional structure by seeking to factor the distribution p(𝒙;𝜽)p(\bm{x};\,\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ) according to a low-dimensional “latent” variable model 𝒛\bm{z}bold_italic_z. Indeed, by conditioning we may write p(𝒙,𝒛;𝜽)=p(𝒛;𝜽)p(𝒙𝒛;𝜽)p(\bm{x},\bm{z};\,\bm{\theta})=p(\bm{z};\,\bm{\theta})p(\bm{x}\mid\bm{z};\,\bm{\theta})italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) = italic_p ( bold_italic_z ; bold_italic_θ ) italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ), and

p(𝒙;𝜽)\displaystyle p(\bm{x};\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ) =p(𝒛;𝜽)p(𝒙𝒛;𝜽)d𝒛\displaystyle=\int p(\bm{z};\,\bm{\theta})p(\bm{x}\mid\bm{z};\,\bm{\theta})\mathrm{d}\bm{z}= ∫ italic_p ( bold_italic_z ; bold_italic_θ ) italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ) roman_d bold_italic_z
=𝔼𝒛p(;𝜽)[p(𝒙𝒛;𝜽)].\displaystyle=\mathbb{E}_{\bm{z}\sim p(\,\cdot\,;\,\bm{\theta})}[p(\bm{x}\mid\bm{z};\,\bm{\theta})].= blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_p ( ⋅ ; bold_italic_θ ) end_POSTSUBSCRIPT [ italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ) ] .

Classically, our model for the data distribution p(𝒙;𝜽)p(\bm{x};\,\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ) corresponds to a choice of the latent distribution p(𝒛;𝜽)p(\bm{z};\,\bm{\theta})italic_p ( bold_italic_z ; bold_italic_θ ) and the conditional distribution p(𝒙𝒛;𝜽)p(\bm{x}\mid\bm{z};\,\bm{\theta})italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ). Even so, computing the model for the data distribution from these latent distributions is intractable except in special cases, analogous to those we have studied in Chapter 2. By the same token, computing the posterior p(𝒛𝒙;𝜽)p(\bm{z}\mid\bm{x};\,\bm{\theta})italic_p ( bold_italic_z ∣ bold_italic_x ; bold_italic_θ ) from data, allowing us to encode samples to their corresponding latents, is generally intractable. There is hence a trade-off between the expressivity of our generative model, its computational tractability, and the accuracy of any approximations we make to the underlying probabilistic framework for the sake of computational tractability. In navigating this trade-off, one also needs a flexible computational approach for learning the model parameters 𝜽\bm{\theta}bold_italic_θ from data, analogous to the maximum likelihood objective.

In the variational autoencoding framework, we navigate this trade-off through three key insights:

  1. 1.

    We posit simple distributions for 𝒛\bm{z}bold_italic_z and 𝒙\bm{x}bold_italic_x conditional on 𝒛\bm{z}bold_italic_z, but make their parameters depend in a highly flexible way on the input data 𝒙\bm{x}bold_italic_x (where relevant) using deep networks.

  2. 2.

    We replace the posterior p(𝒛𝒙;𝜽)p(\bm{z}\mid\bm{x};\,\bm{\theta})italic_p ( bold_italic_z ∣ bold_italic_x ; bold_italic_θ ), used for encoding and whose form is implied (by Bayes rule) by our modeling choices for 𝒛\bm{z}bold_italic_z, with a tractable approximation q(𝒛𝒙;𝜼)q(\bm{z}\mid\bm{x};\,\bm{\eta})italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ), which has its own parameters 𝜼\bm{\eta}bold_italic_η.

  3. 3.

    We jointly learn 𝜽\bm{\theta}bold_italic_θ and 𝜼\bm{\eta}bold_italic_η via maximizing a tractable lower bound for the likelihood p(𝒙;𝜽)p(\bm{x};\,\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ), known as the evidence lower bound (ELBO).

We will focus on the most useful instantiation of the VAE framework in practice, namely where the prior p(𝒛;𝜽)p(\bm{z};\,\bm{\theta})italic_p ( bold_italic_z ; bold_italic_θ ) and the conditional p(𝒙𝒛;𝜽)p(\bm{x}\mid\bm{z};\,\bm{\theta})italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ) are both Gaussian. Namely, we use the following Gaussian distributions:

𝒛\displaystyle\bm{z}bold_italic_z 𝒩(𝟎,𝑰),\displaystyle\sim\mathcal{N}(\mathbf{0},\bm{I}),∼ caligraphic_N ( bold_0 , bold_italic_I ) ,
𝒙𝒛\displaystyle\bm{x}\mid\bm{z}bold_italic_x ∣ bold_italic_z 𝒩(g1(𝒛),diag(eg2(𝒛))𝑰),\displaystyle\sim\mathcal{N}(g_{1}(\bm{z}),\operatorname{diag}(e^{g_{2}(\bm{z})})\bm{I}),∼ caligraphic_N ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_z ) , roman_diag ( italic_e start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_z ) end_POSTSUPERSCRIPT ) bold_italic_I ) ,

where g=(g1,g2)g=(g_{1},g_{2})italic_g = ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are deep networks with parameters 𝜽\bm{\theta}bold_italic_θ, which correspond to the decoder in the autoencoder. Similarly, for the approximate posterior q(𝒛𝒙;𝜼)q(\bm{z}\mid\bm{x};\,\bm{\eta})italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ), we use a special Gaussian distribution with parameters given by an encoder MLP f=(f1,f2)f=(f_{1},f_{2})italic_f = ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with parameters 𝜼\bm{\eta}bold_italic_η:

𝒛𝒙𝒩(f1(𝒙),diag(ef2(𝒙))𝑰).\bm{z}\mid\bm{x}\sim\mathcal{N}(f_{1}(\bm{x}),\operatorname{diag}(e^{f_{2}(\bm{x})})\bm{I}).bold_italic_z ∣ bold_italic_x ∼ caligraphic_N ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) , roman_diag ( italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUPERSCRIPT ) bold_italic_I ) .

This makes probabilistic encoding and decoding simple: we simply map our data 𝒙\bm{x}bold_italic_x to the mean and variance parameters of a Gaussian distribution for encoding, and vice versa for decoding. For learning the encoder and decoder, we start from the maximum likelihood objective Section 5.1.4, and derive a convenient lower bound known as the evidence lower bound, or ELBO. Starting from simple algebraic manipulations

logp(𝒙;𝜽)\displaystyle\log p(\bm{x};\,\bm{\theta})roman_log italic_p ( bold_italic_x ; bold_italic_θ ) =logp(𝒙,𝒛;𝜽)p(𝒛𝒙;𝜽)\displaystyle=\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{p(\bm{z}\mid\bm{x};\,\bm{\theta})}= roman_log divide start_ARG italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) end_ARG start_ARG italic_p ( bold_italic_z ∣ bold_italic_x ; bold_italic_θ ) end_ARG
=logp(𝒙,𝒛;𝜽)q(𝒛𝒙;𝜼)+logq(𝒛𝒙;𝜼)p(𝒛𝒙;𝜽),\displaystyle=\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}+\log\frac{q(\bm{z}\mid\bm{x};\,\bm{\eta})}{p(\bm{z}\mid\bm{x};\,\bm{\theta})},= roman_log divide start_ARG italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) end_ARG start_ARG italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ) end_ARG + roman_log divide start_ARG italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ) end_ARG start_ARG italic_p ( bold_italic_z ∣ bold_italic_x ; bold_italic_θ ) end_ARG ,

we take expectations with respect to 𝒛q(𝒙;𝜼)\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) and use the Gibbs inequality (Theorem 3.1) to get

logp(𝒙;𝜽)𝔼𝒛q(𝒙;𝜼)[logp(𝒙,𝒛;𝜽)q(𝒛𝒙;𝜼)].\log p(\bm{x};\,\bm{\theta})\geq\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}\right].roman_log italic_p ( bold_italic_x ; bold_italic_θ ) ≥ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) end_ARG start_ARG italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ) end_ARG ] .

The right-hand side of this bound is the ELBO; as a lower bound for the pointwise log-likelihood of p(𝒙;𝜽)p(\bm{x};\,\bm{\theta})italic_p ( bold_italic_x ; bold_italic_θ ), its maximization offers a principled compromise between the maximum likelihood objective and computational tractability. Interestingly, the derivation above shows that its tightness depends on the KL divergence between the approximate posterior q(𝒛𝒙;𝜼)q(\bm{z}\mid\bm{x};\,\bm{\eta})italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ) and the true posterior p(𝒛𝒙;𝜽)p(\bm{z}\mid\bm{x};\,\bm{\theta})italic_p ( bold_italic_z ∣ bold_italic_x ; bold_italic_θ ), implying that the more accurate our approximate posterior is, the more maximization of the ELBO leads to maximization of the underlying objective of interest, the likelihood of the data. Thus, the VAE objective is:

max𝜽,𝜼𝔼𝒙𝔼𝒛q(𝒙;𝜼)[logp(𝒙,𝒛;𝜽)q(𝒛𝒙;𝜼)].\max_{\bm{\theta},\bm{\eta}}\,\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}\right].roman_max start_POSTSUBSCRIPT bold_italic_θ , bold_italic_η end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) end_ARG start_ARG italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ) end_ARG ] . (5.1.14)

By our derivation, maximization of this objective corresponds to a tradeoff between maximizing the likelihood function of the data and minimizing the KL divergence between the approximate and true posterior, which is a highly sensible objective given the VAE modeling assumptions.

VAE Training as Probabilistic Autoencoding

There is a general methodology for maximizing the ELBO objective in Equation 5.1.14 using stochastic gradient descent and various tractable Monte Carlo estimators for the associated gradients. However, the task is simpler under the Gaussian assumptions we have made above. In this case, the ELBO reads

max𝜽,𝜼𝔼𝒙𝔼𝒛q(𝒙;𝜼)[logp(𝒙,𝒛;𝜽)q(𝒛𝒙;𝜼)]\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log\frac{p(\bm{x},\bm{z};\,\bm{\theta})}{q(\bm{z}\mid\bm{x};\,\bm{\eta})}\right]roman_max start_POSTSUBSCRIPT bold_italic_θ , bold_italic_η end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) end_ARG start_ARG italic_q ( bold_italic_z ∣ bold_italic_x ; bold_italic_η ) end_ARG ]
=\displaystyle== max𝜽,𝜼{𝔼𝒙𝔼𝒛q(𝒙;𝜼)[logp(𝒙,𝒛;𝜽)]+dlog(2πe)2+12𝔼𝒙[f2(𝒙)],𝟏}\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\left\{\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x},\bm{z};\,\bm{\theta})\right]+\frac{d\log(2\pi e)}{2}+\frac{1}{2}\langle\mathbb{E}_{\bm{x}}[f_{2}(\bm{x})],\mathbf{1}\rangle\right\}roman_max start_POSTSUBSCRIPT bold_italic_θ , bold_italic_η end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_italic_x , bold_italic_z ; bold_italic_θ ) ] + divide start_ARG italic_d roman_log ( 2 italic_π italic_e ) end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ] , bold_1 ⟩ }
\displaystyle\equiv max𝜽,𝜼{𝔼𝒙𝔼𝒛q(𝒙;𝜼)[logp(𝒙𝒛;𝜽)+logp(𝒛)]+12𝔼𝒙[f2(𝒙)],𝟏}\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\left\{\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x}\mid\bm{z};\,\bm{\theta})+\log p(\bm{z})\right]+\frac{1}{2}\langle\mathbb{E}_{\bm{x}}[f_{2}(\bm{x})],\mathbf{1}\rangle\right\}roman_max start_POSTSUBSCRIPT bold_italic_θ , bold_italic_η end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ) + roman_log italic_p ( bold_italic_z ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ] , bold_1 ⟩ }
\displaystyle\equiv max𝜽,𝜼{2𝔼𝒙[𝔼𝒛q(𝒙;𝜼)[logp(𝒙𝒛;𝜽)]+f2(𝒙)ef2(𝒙),𝟏f1(𝒙)22]},\displaystyle\max_{\bm{\theta},\bm{\eta}}\,\left\{2\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x}\mid\bm{z};\,\bm{\theta})\right]+\langle f_{2}(\bm{x})-e^{f_{2}(\bm{x})},\mathbf{1}\rangle-\|f_{1}(\bm{x})\|_{2}^{2}\right]\right\},roman_max start_POSTSUBSCRIPT bold_italic_θ , bold_italic_η end_POSTSUBSCRIPT { 2 blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ) ] + ⟨ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) - italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUPERSCRIPT , bold_1 ⟩ - ∥ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] } , (5.1.15)

following Theorem B.1 and the Gaussian entropy calculation done in Appendix B, where \equiv denotes equivalence of optimization objectives (in each case, we remove some additive constants that do not change the optimization problem’s solutions). The remaining term corresponds to the “autoencoding” part of the ELBO objective: intuitively, it seeks to maximize the likelihood of data generated by the decoder when the latents 𝒛\bm{z}bold_italic_z are distribited according to the approximate posterior (generated by the encoder applied to a data sample 𝒙\bm{x}bold_italic_x), which is a probabilistic form of autoencoding. To see this more directly, consider the special case where the approximate posterior concentrates on its mean f1(𝒙)f_{1}(\bm{x})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ), for every 𝒙\bm{x}bold_italic_x: this is the limit where its log-variance f2,i(𝒙)f_{2,i}(\bm{x})\to-\inftyitalic_f start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ( bold_italic_x ) → - ∞ for each coordinate i=1,,di=1,\dots,ditalic_i = 1 , … , italic_d. For simplicity, assume moreover that g2(𝒙)=𝟏g_{2}(\bm{x})=\mathbf{1}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) = bold_1, giving the decoder constant unit variance on each coordinate. Then the loss term in question converges to

𝔼𝒙𝔼𝒛q(𝒙;𝜼)[logp(𝒙𝒛;𝜽)]\displaystyle\mathbb{E}_{\bm{x}}\mathbb{E}_{\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})}\left[\log p(\bm{x}\mid\bm{z};\,\bm{\theta})\right]blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_italic_x ∣ bold_italic_z ; bold_italic_θ ) ] 𝔼𝒙[logp(𝒙f1(𝒙);𝜽)]\displaystyle\to\mathbb{E}_{\bm{x}}\left[\log p(\bm{x}\mid f_{1}(\bm{x});\,\bm{\theta})\right]→ blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ roman_log italic_p ( bold_italic_x ∣ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) ; bold_italic_θ ) ]
12𝔼𝒙[𝒙g1f1(𝒙)22].\displaystyle\equiv-\frac{1}{2}\mathbb{E}_{\bm{x}}\left[\|\bm{x}-g_{1}\circ f_{1}(\bm{x})\|_{2}^{2}\right].≡ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT [ ∥ bold_italic_x - italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

So with a highly confident encoder, which deterministically maps each sample 𝒙\bm{x}bold_italic_x to a single point f1(𝒙)f_{1}(\bm{x})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) in 𝒛\bm{z}bold_italic_z-space, and an isotropic decoder, the “autoencoding” part of the ELBO maximization problem indeed becomes a classical autoencoding objective!555In the general case where g2g_{2}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is not fixed as 𝟏\mathbf{1}bold_1, the reader can verify that the autoencoding term in the ELBO objective converges to a regularized classical autoencoding objective. At the same time, note that this special case is actually excluded by the extra terms in Equation 5.1.15—these effectively correspond to regularization terms that discourage the encoder from collapsing. The general ELBO loss (Equations 5.1.14 and 5.1.15) is therefore a strict generalization of the classical autoencoding reconstruction objective (Equation 5.1.12), both in terms of its data fidelity term and in terms of the inclusion of regularization terms.

Training a VAE

VAEs are typically trained by alternating stochastic gradient ascent on the ELBO objective (Equation 5.1.15), given individual samples 𝒙\bm{x}bold_italic_x from the true data distribution and from 𝒛q(𝒙;𝜼)\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ). In particular, it is standard to collect and train on many independently-generated samples 𝒛i\bm{z}^{i}bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each sample 𝒙\bm{x}bold_italic_x. To take gradients of Equation 5.1.15 with respect to the encoder parameters 𝜼\bm{\eta}bold_italic_η, one makes use of the so-called reparameterization trick by writing 𝒛q(𝒙;𝜼)\bm{z}\sim q(\,\cdot\,\mid\bm{x};\,\bm{\eta})bold_italic_z ∼ italic_q ( ⋅ ∣ bold_italic_x ; bold_italic_η ) as

𝒛=df1(𝒙)+diag(e12f2(𝒙))𝒈,𝒈𝒩(0,𝑰),\bm{z}=_{\mathrm{d}}f_{1}(\bm{x})+\operatorname{diag}(e^{\tfrac{1}{2}f_{2}(\bm{x})})\bm{g},\quad\bm{g}\sim\mathcal{N}(0,\bm{I}),bold_italic_z = start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) + roman_diag ( italic_e start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUPERSCRIPT ) bold_italic_g , bold_italic_g ∼ caligraphic_N ( 0 , bold_italic_I ) ,

where =d=_{\mathrm{d}}= start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT denotes equality in distribution. Then one can simply take many independent standard Gaussian samples 𝒈i\bm{g}^{i}bold_italic_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each data sample 𝒙\bm{x}bold_italic_x, generate the corresponding samples from the approximate posterior 𝒛i\bm{z}^{i}bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and compute gradients with respect to 𝜼\bm{\eta}bold_italic_η using automatic differentiation without any issues.

5.2 Learning Self-Consistent Representations

In earlier chapters, we have studied methods that would allow us to learn a low-dimensional distribution via (lossy) compression. As we have mentioned in Chapter 1 and demonstrated in the previous chapters, the progresses made in machine intelligence largely rely on finding computationally feasible and efficient solutions to realize the desired compression, not only computable or tractable in theory, but also scalable in practice:

computabletractablescalable.\mbox{{computable}}\;\Longrightarrow\;\mbox{{tractable}}\;\Longrightarrow\;\mbox{{scalable}}.computable ⟹ tractable ⟹ scalable . (5.2.1)

It is even fair to say that the tremendous advancement in machine intelligence made in the past decade or so is largely attributed to the development of scalable models and methods, say by training deep networks via back propagation. Large models with billions of parameters trained with trillions of data points on tens of thousands of powerful GPUs have demonstrated super-human capabilities in memorizing existing knowledge. This has led many to believe that the “intelligence” of such models will continue to improve as their scale continues to go up.

While we celebrate the engineering marvels of such large man-made machine learning systems, we also must admit that, compared to intelligence in nature, this approach to improving machine intelligence is unnecessarily resource-demanding. Natural intelligent beings, including animals and humans, simply cannot afford such a brute-force solution for learning because they must operate with a very limited budget in energy, space, and time, subject to many strict physical constraints.

First, there is strong scientific evidence that our brain does not conduct global end-to-end back propagation to improve or correct its predictions. Instead, it has long been known in neuroscience that our brain corrects errors with local closed-loop feedback, such as predictive coding. This was the scientific basis that inspired Norbert Wiener to develop the theory of feedback control and the Cybernetics program back in the 1940s.

Second, we saw in the previous sections that in order to learn a consistent representation, one needs to learn a bidirectional autoencoding:

𝑿=f𝒁𝒟=g𝑿^.\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}.bold_italic_X start_ARROW start_OVERACCENT caligraphic_E = italic_f end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D = italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (5.2.2)

It requires enforcing the observed input data 𝑿\bm{X}bold_italic_X and the decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG to be close by some measure of similarity d(𝑿,𝑿^)d(\bm{X},\hat{\bm{X}})italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ). In nature, animals or humans rarely have direct access to the ground truth 𝑿\bm{X}bold_italic_X. For example, we never have direct access to true 3D shape, distance, or dynamics of objects in a scene. Yet, we have all learned to estimate and predict them very accurately and efficiently. Hence, an outstanding question is: how can we learn the true distribution of 𝑿\bm{X}bold_italic_X, even though we cannot directly compare our estimate 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG with the ground truth 𝒙\bm{x}bold_italic_x? As we will see in this chapter, the answer also lies in the closed-loop feedback, as well as the low-dimensionality of the data distribution.

As we know from the previous chapter, in order to ensure that a representation is consistent, we need to compare the generated 𝑿^p(𝒙^)\hat{\bm{X}}\sim p(\hat{\bm{x}})over^ start_ARG bold_italic_X end_ARG ∼ italic_p ( over^ start_ARG bold_italic_x end_ARG ) and the original 𝑿p(𝒙)\bm{X}\sim p(\bm{x})bold_italic_X ∼ italic_p ( bold_italic_x ), at least in distribution. Even when we do have access to 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG, technically, computing and minimizing the distance between two distributions can be problematic, especially when the support of the distributions is low-dimensional. The KL-divergence introduced in Chapter 3 is not even well-defined between two distributions that do not have overlapping supports.

As an early attempt to alleviate the above difficulty in computing and minimizing the distance between two (low-dimensional) distributions, people had suggested learning the generator/decoder ggitalic_g via discriminative approaches [Tu07]. This line of thought has led to the idea of Generative Adversarial Nets (GAN) [GPM+14a]. It introduces a discriminator dditalic_d, usually modeled by a deep network, to discern differences between the generated samples 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG and the real ones 𝑿\bm{X}bold_italic_X:

𝒁g(𝒛,η)𝑿^,𝑿d(𝒙,θ){𝟎,𝟏}.\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\eta)\hskip 5.69054pt}\hat{\bm{X}},\,\bm{X}\xrightarrow{\hskip 5.69054ptd(\bm{x},\theta)\hskip 5.69054pt}\{\mathbf{0},\mathbf{1}\}.bold_italic_Z start_ARROW start_OVERACCENT italic_g ( bold_italic_z , italic_η ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG , bold_italic_X start_ARROW start_OVERACCENT italic_d ( bold_italic_x , italic_θ ) end_OVERACCENT → end_ARROW { bold_0 , bold_1 } . (5.2.3)

It is suggested that we may attempt to align the distributions between 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG and 𝒙\bm{x}bold_italic_x via a Stackelberg game between the generator ggitalic_g and the discriminator dditalic_d:

maxθminη𝔼p(𝒙)[logd(𝒙,θ)]+𝔼p(𝒛)[1logd(g(𝒛,η)𝒙^pg,θ)].\max_{\theta}\min_{\eta}\mathbb{E}_{p({\bm{x}})}\big{[}\log d(\bm{x},\theta)\big{]}+\mathbb{E}_{p({\bm{z}})}\big{[}1-\log d(\underbrace{g(\bm{z},\eta)}_{\hat{\bm{x}}\,\sim\,p_{g}},\theta)\big{]}.roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x ) end_POSTSUBSCRIPT [ roman_log italic_d ( bold_italic_x , italic_θ ) ] + blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_z ) end_POSTSUBSCRIPT [ 1 - roman_log italic_d ( under⏟ start_ARG italic_g ( bold_italic_z , italic_η ) end_ARG start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG ∼ italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ ) ] . (5.2.4)

That is, the discriminator dditalic_d is trying to minimize the cross entropy between the true samples 𝑿\bm{X}bold_italic_X and the generated ones 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG while the generator ggitalic_g is trying to do the opposite.

One may show that finding an equilibrium for the above Stackelberg game is equivalent to minimizing the Jensen-Shannon divergence:

𝒟JS(p(𝒙),pg(𝒙^))=𝒟KL(p(p+pg)/2)+𝒟KL(pg(p+pg)/2).\mathcal{D}_{JS}(p(\bm{x}),p_{g}(\hat{\bm{x}}))=\mathcal{D}_{KL}\big{(}p\|(p+p_{g})/{2}\big{)}+\mathcal{D}_{KL}\big{(}p_{g}\|(p+p_{g})/{2}\big{)}.caligraphic_D start_POSTSUBSCRIPT italic_J italic_S end_POSTSUBSCRIPT ( italic_p ( bold_italic_x ) , italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG ) ) = caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p ∥ ( italic_p + italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) / 2 ) + caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ ( italic_p + italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) / 2 ) . (5.2.5)

Note that, compared to the KL-divergence, the JS-divergence is well-defined even if the supports of the two distributions are non-overlapping. However, JS-divergence does not have a closed-form expression even between two Gaussians, whereas KL-divergence does. In addition, since the data distributions are low-dimensional, the JS-divergence can be highly ill-conditioned to optimize.666as shown in [ACB17]. This may explain why many additional heuristics are typically used in many subsequent variants of GAN.

So, instead, it has also been suggested to replace JS-divergence with the earth mover (EM) distance or the Wasserstein distance.777Roughly speaking, for distributions with potentially non-overlapping low-dimensional supports, the JS-divergence behaves like the 0\ell^{0}roman_ℓ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT-norm, and the EM-distance behaves like the 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-norm. However, both the JS-divergence and Wasserstein distance can only be approximately computed between two general distributions.888For instance, the Wasserstein distance requires one to compute the maximal difference between expectations of the two distributions over all 1-Lipschitz functions. Furthermore, neither the JS-divergence nor the Wasserstein distance has closed-form formulae, even for Gaussian distributions.999The (1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-norm) Wasserstein distance can be bounded by the (2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-norm) Wasserstein distance which has a closed-form [DDS22]. However, as is well-known in high-dimensional geometry, 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-norm and 2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm deviate significantly in terms of their geometric and statistical properties as the dimension becomes high [WM21]. The bound can become very loose.

If it is difficult to compare distributions of the data 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG, would it be possible to compare the learned feature 𝒁\bm{Z}bold_italic_Z with its image 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG under the encoder ffitalic_f:

𝑿=f𝒁𝒟=g𝑿^=f𝒁^?\bm{X}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\bm{Z}\xrightarrow{\hskip 2.84526pt\mathcal{D}=g\hskip 2.84526pt}\hat{\bm{X}}\xrightarrow{\hskip 2.84526pt\mathcal{E}=f\hskip 2.84526pt}\hat{\bm{Z}}?bold_italic_X start_ARROW start_OVERACCENT caligraphic_E = italic_f end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D = italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG start_ARROW start_OVERACCENT caligraphic_E = italic_f end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_Z end_ARG ? (5.2.6)

This leads to the notion of self-consistent representation.

Definition 5.2 (Self-Consistent Representation).

Given data 𝑿\bm{X}bold_italic_X, we call a self-consistent representation a pair of functions (f:𝒳𝒵,g:𝒵𝒳)(f\colon\mathcal{X}\to\mathcal{Z},g\colon\mathcal{Z}\to\mathcal{X})( italic_f : caligraphic_X → caligraphic_Z , italic_g : caligraphic_Z → caligraphic_X ), such that the features 𝒁=f(𝑿)\bm{Z}=f(\bm{X})bold_italic_Z = italic_f ( bold_italic_X ) are compact and structured, and the autoencoding features 𝒁^fg(𝒁)\hat{\bm{Z}}\doteq f\circ g(\bm{Z})over^ start_ARG bold_italic_Z end_ARG ≐ italic_f ∘ italic_g ( bold_italic_Z ) are close to 𝒁\bm{Z}bold_italic_Z.

  1. 1.

    We say that it is sample-wise self-consistent if 𝒁𝒁^\bm{Z}\approx\hat{\bm{Z}}bold_italic_Z ≈ over^ start_ARG bold_italic_Z end_ARG in a certain norm with high probability.

  2. 2.

    We say that the representation is distributionally self-consistent if Law(𝒁)Law(𝒁^)\operatorname{Law}(\bm{Z})\approx\operatorname{Law}(\hat{\bm{Z}})roman_Law ( bold_italic_Z ) ≈ roman_Law ( over^ start_ARG bold_italic_Z end_ARG ).

5.2.1 Closed-Loop Transcription via Stackelberg Games

How do we try to ensure a learned representation is self-consistent? As usual, let us assume 𝑿=k=1K𝑿k\bm{X}=\cup_{k=1}^{K}\bm{X}_{k}bold_italic_X = ∪ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with each subset of samples 𝑿k\bm{X}_{k}bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT belonging to a low-dimensional submanifold: 𝑿kk,k=1,,K\bm{X}_{k}\subset\mathcal{M}_{k},k=1,\ldots,Kbold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ caligraphic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 1 , … , italic_K. Following the notation in Chapter 3, we use a matrix 𝚷k(i,i)=1\bm{\Pi}_{k}(i,i)=1bold_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i , italic_i ) = 1 to denote the membership of sample iiitalic_i belonging to class kkitalic_k and 𝚷k(i,j)=0\bm{\Pi}_{k}(i,j)=0bold_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i , italic_j ) = 0 otherwise. One seeks a continuous mapping f(,𝜽):𝒙𝒛f(\cdot,\bm{\theta}):\bm{x}\mapsto\bm{z}italic_f ( ⋅ , bold_italic_θ ) : bold_italic_x ↦ bold_italic_z from 𝑿\bm{X}bold_italic_X to an optimal representation 𝒁=[𝒛1,,𝒛N]d×N\bm{Z}=[\bm{z}_{1},\ldots,\bm{z}_{N}]\subset\mathbb{R}^{d\times N}bold_italic_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT:

𝑿f(𝒙,𝜽)𝒁,\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\bm{Z},bold_italic_X start_ARROW start_OVERACCENT italic_f ( bold_italic_x , bold_italic_θ ) end_OVERACCENT → end_ARROW bold_italic_Z , (5.2.7)

which maximizes the following coding rate-reduction (MCR2) objective:

max𝒁ΔRϵ(𝒁𝚷)12logdet(𝑰+α𝒁𝒁)Rϵ(𝒁)k=1Kγk2logdet(𝑰+αk𝒁𝚷j𝒁)Rϵc(𝒁𝚷),\begin{split}\max_{\bm{Z}}\;\Delta R_{\epsilon}(\bm{Z}\mid\bm{\Pi})&\doteq\underbrace{\frac{1}{2}\log\det\Big{(}\bm{I}+{\alpha}\bm{Z}\bm{Z}^{\top}\Big{)}}_{R_{\epsilon}(\bm{Z})}\;-\;\underbrace{\sum_{k=1}^{K}\frac{\gamma_{k}}{2}\log\det\Big{(}\bm{I}+{\alpha_{k}}\bm{Z}\bm{\Pi}^{j}\bm{Z}^{\top}\Big{)}}_{R^{c}_{\epsilon}(\bm{Z}\mid\bm{\Pi})},\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT bold_italic_Z end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ∣ bold_Π ) end_CELL start_CELL ≐ under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log roman_det ( bold_italic_I + italic_α bold_italic_Z bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) end_POSTSUBSCRIPT - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG roman_log roman_det ( bold_italic_I + italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_Z bold_Π start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ∣ bold_Π ) end_POSTSUBSCRIPT , end_CELL end_ROW (5.2.8)

where α=d/(Nϵ2)\alpha={d}/({N\epsilon^{2}})italic_α = italic_d / ( italic_N italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), αk=d/(tr(𝚷k)ϵ2)\alpha_{k}=d/({\mathrm{tr}(\bm{\Pi}_{k})\epsilon^{2}})italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d / ( roman_tr ( bold_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), γk=tr(𝚷k)/N\gamma_{k}={\mathrm{tr}(\bm{\Pi}_{k})}/{N}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_tr ( bold_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_N for each k=1,,Kk=1,\dots,Kitalic_k = 1 , … , italic_K.

One issue with learning such a one-sided mapping (5.2.7) via maximizing the above objective (5.2.8) is that it tends to expand the dimension of the learned subspace for features in each class101010If the dimension of the feature space dditalic_d is too high, maximizing the rate reduction may over-estimate the dimension of each class. Hence, to learn a good representation, one needs to pre-select a proper dimension for the feature space, as achieved in the experiments in [YCY+20]. In fact the same “model selection” problem persists even in the simplest single-subspace case, which is the classic PCA [Jol86]. To our best knowledge, selecting the correct number of principal components in a heterogeneous noisy situation remains an active research topic [HSD20]., as we have seen in Section 3.4 of Chapter 3. To verify whether the learned features are consistent, meaning neither over-estimating nor under-estimating the intrinsic data structure, we may consider learning a decoder g(,η):𝒛𝒙g(\cdot,\eta):\bm{z}\mapsto\bm{x}italic_g ( ⋅ , italic_η ) : bold_italic_z ↦ bold_italic_x from the representation 𝒁=f(𝑿,𝜽)\bm{Z}=f(\bm{X},\bm{\theta})bold_italic_Z = italic_f ( bold_italic_X , bold_italic_θ ) back to the data space 𝒙\bm{x}bold_italic_x: 𝑿^=g(𝒁,𝜼)\hat{\bm{X}}=g(\bm{Z},\bm{\eta})over^ start_ARG bold_italic_X end_ARG = italic_g ( bold_italic_Z , bold_italic_η ):

𝑿f(𝒙,𝜽)𝒁g(𝒛,𝜼)𝑿^,\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}},bold_italic_X start_ARROW start_OVERACCENT italic_f ( bold_italic_x , bold_italic_θ ) end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT italic_g ( bold_italic_z , bold_italic_η ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG , (5.2.9)

and check how close 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG are. This forms an autoencoding which is what we have studied extensively in the previous Chapter 5.

Measuring distance in the feature space.

However, as we have discussed above, if we do not have the option to compute the distance between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG, we are left with the option of comparing their corresponding features 𝒁\bm{Z}bold_italic_Z and 𝒁^=f(𝑿^,𝜽)\hat{\bm{Z}}=f(\hat{\bm{X}},\bm{\theta})over^ start_ARG bold_italic_Z end_ARG = italic_f ( over^ start_ARG bold_italic_X end_ARG , bold_italic_θ ). Notice that under the MCR2 objective, the distributions of the resulting 𝒁\bm{Z}bold_italic_Z or 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG tend to be piecewise linear so their distance can be computed more easily. More precisely, since the features of each class, 𝒁k\bm{Z}_{k}bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒁^k\hat{\bm{Z}}_{k}over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, are close to being a low-dimensional subspace/Gaussian, their “distance” can be measured by the rate reduction, with (5.2.8) restricted to two sets of equal size:

ΔRϵ(𝒁k,𝒁^k)Rϵ(𝒁k𝒁^k)12(Rϵ(𝒁k)+Rϵ(𝒁^k)).\Delta R_{\epsilon}\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}\doteq R_{\epsilon}\big{(}\bm{Z}_{k}\cup\hat{\bm{Z}}_{k}\big{)}-\frac{1}{2}\big{(}R_{\epsilon}\big{(}\bm{Z}_{k})+R_{\epsilon}\big{(}\hat{\bm{Z}}_{k})\big{)}.roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≐ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∪ over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) . (5.2.10)

The overall distance between 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG is given by:

d(𝒁,𝒁^)k=1KΔRϵ(𝒁k,𝒁^k)=k=1KΔRϵ(𝒁k,f(g(𝒁k,𝜼),𝜽)).d(\bm{Z},\hat{\bm{Z}})\doteq\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}=\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k},f(g(\bm{Z}_{k},\bm{\eta}),\bm{\theta})\big{)}.italic_d ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) ≐ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f ( italic_g ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_η ) , bold_italic_θ ) ) . (5.2.11)

If we are interested in discerning any differences in the distributions of the original data 𝑿\bm{X}bold_italic_X and their decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG, we may view the feature encoder f(,𝜽)f(\cdot,\bm{\theta})italic_f ( ⋅ , bold_italic_θ ) as a “discriminator” to magnify any difference between all pairs of 𝑿k\bm{X}_{k}bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝑿^k\hat{\bm{X}}_{k}over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, by simply maximizing the distance d(𝑿,𝑿^)d(\bm{X},\hat{\bm{X}})italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ):

maxfd(𝒁,𝒁^)=max𝜽k=1KΔRϵ(𝒁k,𝒁^k)=max𝜽k=1KΔRϵ(f(𝑿k,𝜽),f(𝑿^k,𝜽)).\max_{f}d(\bm{Z},\hat{\bm{Z}})=\max_{\bm{\theta}}\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}=\max_{\bm{\theta}}\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}f(\bm{X}_{k},\bm{\theta}),f(\hat{\bm{X}}_{k},\bm{\theta})\big{)}.roman_max start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_d ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) = roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_f ( bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) , italic_f ( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ) . (5.2.12)

That is, a “distance” between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG can be measured as the maximally achievable rate reduction between all pairs of classes in these two sets. In a way, this measures how well or badly the decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG aligns with the original data 𝑿\bm{X}bold_italic_X—hence measuring the goodness of “injectivity” of the encoder ffitalic_f. Notice that such a discriminative measure is consistent with the idea of GAN (5.2.3) that tries to separate 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG into two classes, measured by the cross-entropy.

Nevertheless, here the discriminative encoder ffitalic_f naturally generalizes to cases when the data distributions are multi-class and multi-modal, and the discriminativeness is measured with a more refined measure—the rate reduction—instead of the typical two-class loss (e.g., cross entropy) used in GANs. That is, we may view the encoder ffitalic_f as a generalized discriminator that replaces the binary classifier dditalic_d in (5.2.3):

𝒁g(𝒛,𝜼)𝑿^,𝑿f(𝒙,𝜽){𝒁^,𝒁}.\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}},\,\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\{\hat{\bm{Z}},\bm{Z}\}.bold_italic_Z start_ARROW start_OVERACCENT italic_g ( bold_italic_z , bold_italic_η ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG , bold_italic_X start_ARROW start_OVERACCENT italic_f ( bold_italic_x , bold_italic_θ ) end_OVERACCENT → end_ARROW { over^ start_ARG bold_italic_Z end_ARG , bold_italic_Z } . (5.2.13)

The overall pipeline can be illustrated by the following “closed-loop” diagram:

𝑿f(𝒙,𝜽)𝒁g(𝒛,𝜼)𝑿^f(𝒙,𝜽)𝒁^,\bm{X}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\ \hat{\bm{Z}},bold_italic_X start_ARROW start_OVERACCENT italic_f ( bold_italic_x , bold_italic_θ ) end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT italic_g ( bold_italic_z , bold_italic_η ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG start_ARROW start_OVERACCENT italic_f ( bold_italic_x , bold_italic_θ ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_Z end_ARG , (5.2.14)

where the overall model has parameters: 𝚯={𝜽,𝜼}\bm{\Theta}=\{\bm{\theta},\bm{\eta}\}bold_Θ = { bold_italic_θ , bold_italic_η }. Figure 5.6 shows the overall process.

Figure 5.6 : A Closed-loop Transcription. The encoder f f italic_f has dual roles: it learns a representation 𝒛 \bm{z} bold_italic_z for the data 𝒙 \bm{x} bold_italic_x via maximizing the rate reduction of 𝒛 \bm{z} bold_italic_z and it is also a “feedback sensor” for any discrepancy between the data 𝒙 \bm{x} bold_italic_x and the decoded 𝒙 ^ \hat{\bm{x}} over^ start_ARG bold_italic_x end_ARG . The decoder g g italic_g also has dual roles: it is a “controller” that corrects the discrepancy between 𝒙 \bm{x} bold_italic_x and 𝒙 ^ \hat{\bm{x}} over^ start_ARG bold_italic_x end_ARG and it also aims to minimize the overall coding rate for the learned representation.
Figure 5.6: A Closed-loop Transcription. The encoder ffitalic_f has dual roles: it learns a representation 𝒛\bm{z}bold_italic_z for the data 𝒙\bm{x}bold_italic_x via maximizing the rate reduction of 𝒛\bm{z}bold_italic_z and it is also a “feedback sensor” for any discrepancy between the data 𝒙\bm{x}bold_italic_x and the decoded 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG. The decoder ggitalic_g also has dual roles: it is a “controller” that corrects the discrepancy between 𝒙\bm{x}bold_italic_x and 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG and it also aims to minimize the overall coding rate for the learned representation.
Encoder and decoder as a two-player game.

Obviously, to ensure the learned autoencoding is self-consistent, the main goal of the decoder g(,𝜼)g(\cdot,\bm{\eta})italic_g ( ⋅ , bold_italic_η ) is to minimize the distance between 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG. That is, to learn ggitalic_g, we want to minimize the distance d(𝒁,𝒁^)d(\bm{Z},\hat{\bm{Z}})italic_d ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ):

mingd(𝒁,𝒁^)minηk=1KΔR(𝒁k,𝒁^k)=min𝜼k=1KΔR(𝒁k,f(g(𝒁k,𝜼),𝜽)),\min_{g}d(\bm{Z},\hat{\bm{Z}})\doteq\min_{\eta}\sum_{k=1}^{K}\Delta R\big{(}\bm{Z}_{k},\hat{\bm{Z}}_{k}\big{)}=\min_{\bm{\eta}}\sum_{k=1}^{K}\Delta R\big{(}\bm{Z}_{k},f(g(\bm{Z}_{k},\bm{\eta}),\bm{\theta})\big{)},roman_min start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_d ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) ≐ roman_min start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f ( italic_g ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_η ) , bold_italic_θ ) ) , (5.2.15)

where 𝒁k=f(𝑿k,𝜽)\bm{Z}_{k}=f(\bm{X}_{k},\bm{\theta})bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f ( bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) and 𝒁^k=f(𝑿^k,𝜽)\hat{\bm{Z}}_{k}=f(\hat{\bm{X}}_{k},\bm{\theta})over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f ( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ).

Example 5.2.

One may wonder why we need the mapping f(,𝜽)f(\cdot,\bm{\theta})italic_f ( ⋅ , bold_italic_θ ) to function as a discriminator between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG by maximizing max𝜽ΔRϵ(f(𝑿,𝜽),f(𝑿^,𝜽))\max_{\bm{\theta}}\Delta R_{\epsilon}\big{(}f(\bm{X},\bm{\theta}),f(\hat{\bm{X}},\bm{\theta})\big{)}roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_f ( bold_italic_X , bold_italic_θ ) , italic_f ( over^ start_ARG bold_italic_X end_ARG , bold_italic_θ ) ). Figure 5.7 gives a simple illustration: there might be many decoders ggitalic_g such that fgf\circ gitalic_f ∘ italic_g is an identity (Id) mapping. fg(𝒛)=𝒛f\circ g(\bm{z})=\bm{z}italic_f ∘ italic_g ( bold_italic_z ) = bold_italic_z for all 𝒛\bm{z}bold_italic_z in the subspace S𝒛S_{\bm{z}}italic_S start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT in the feature space. However, gfg\circ fitalic_g ∘ italic_f is not necessarily an autoencoding map for 𝒙\bm{x}bold_italic_x in the original distribution S𝒙S_{\bm{x}}italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT (here drawn as a subspace for simplicity). That is, gf(S𝒙)S𝒙g\circ f(S_{\bm{x}})\not\subset S_{\bm{x}}italic_g ∘ italic_f ( italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) ⊄ italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT, let alone gf(S𝒙)=S𝒙g\circ f(S_{\bm{x}})=S_{\bm{x}}italic_g ∘ italic_f ( italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) = italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT or gf(𝒙)=𝒙g\circ f(\bm{x})=\bm{x}italic_g ∘ italic_f ( bold_italic_x ) = bold_italic_x. One should expect that, without careful control of the image of ggitalic_g, this would be the case with high probability, especially when the support of the distribution of 𝒙\bm{x}bold_italic_x is extremely low-dimensional in the original high-dimensional data space. \blacksquare

Figure 5.7 : Embeddings of low-dim submanifolds in a high-dim space. S 𝒙 S_{\bm{x}} italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT (blue) is the submanifold for the original data 𝒙 \bm{x} bold_italic_x ; S 𝒛 S_{\bm{z}} italic_S start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT (red) is the image of S 𝒙 S_{\bm{x}} italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT under the mapping f f italic_f , representing the learned feature 𝒛 \bm{z} bold_italic_z ; and the green curve is the image of the feature 𝒛 \bm{z} bold_italic_z under the decoding mapping g g italic_g .
Figure 5.7: Embeddings of low-dim submanifolds in a high-dim space. S𝒙S_{\bm{x}}italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT (blue) is the submanifold for the original data 𝒙\bm{x}bold_italic_x; S𝒛S_{\bm{z}}italic_S start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT (red) is the image of S𝒙S_{\bm{x}}italic_S start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT under the mapping ffitalic_f, representing the learned feature 𝒛\bm{z}bold_italic_z; and the green curve is the image of the feature 𝒛\bm{z}bold_italic_z under the decoding mapping ggitalic_g.

Comparing the contractive and contrastive nature of (5.2.15) and (5.2.12) on the same distance measure, we see the roles of the encoder f(,𝜽)f(\cdot,\bm{\theta})italic_f ( ⋅ , bold_italic_θ ) and the decoder g(,𝜼)g(\cdot,\bm{\eta})italic_g ( ⋅ , bold_italic_η ) naturally as “a two-player game”: while the encoder ffitalic_f tries to magnify the difference between the original data and their transcribed data, the decoder ggitalic_g aims to minimize the difference. Now for convenience, let us define the “closed-loop encoding” function:

h(𝒙,𝜽,𝜼)f(g(f(𝒙,𝜽),𝜼),𝜽):𝒙𝒛^.h(\bm{x},\bm{\theta},\bm{\eta})\doteq f\big{(}g\big{(}f(\bm{x},\bm{\theta}),\bm{\eta}\big{)},\bm{\theta}\big{)}:\;\bm{x}\mapsto\hat{\bm{z}}.italic_h ( bold_italic_x , bold_italic_θ , bold_italic_η ) ≐ italic_f ( italic_g ( italic_f ( bold_italic_x , bold_italic_θ ) , bold_italic_η ) , bold_italic_θ ) : bold_italic_x ↦ over^ start_ARG bold_italic_z end_ARG . (5.2.16)

Ideally, we want this function to be very close to f(𝒙,𝜽)f(\bm{x},\bm{\theta})italic_f ( bold_italic_x , bold_italic_θ ) or at least the distributions of their images should be close. With this notation, combining (5.2.15) and (5.2.12), a closed-loop notion of “distance” between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG can be computed as an equilibrium point to the following Stackelberg game (cf Section A.3) for the same utility in terms of rate reduction:

𝒟(𝑿,𝑿^)max𝜽min𝜼k=1KΔRϵ(f(𝑿k,𝜽),h(𝑿k,𝜽,𝜼)).\mathcal{D}(\bm{X},\hat{\bm{X}})\doteq\max_{\bm{\theta}}\min_{\bm{\eta}}\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}f(\bm{X}_{k},\bm{\theta}),h(\bm{X}_{k},\bm{\theta},\bm{\eta})\big{)}.caligraphic_D ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) ≐ roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_f ( bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) , italic_h ( bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , bold_italic_η ) ) . (5.2.17)

Notice that this only measures the difference between (features of) the original data and its transcribed version. It does not measure how good the representation 𝒁\bm{Z}bold_italic_Z (or 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG) is for the multiple classes within 𝑿\bm{X}bold_italic_X (or 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG). To this end, we may combine the above distance with the original MCR2-type objectives (5.2.8): namely, the rate reduction ΔRϵ(𝒁)\Delta R_{\epsilon}(\bm{Z})roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) and ΔRϵ(𝒁^)\Delta R_{\epsilon}(\hat{\bm{Z}})roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ) for the learned LDR 𝒁\bm{Z}bold_italic_Z for 𝑿\bm{X}bold_italic_X and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG for the decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG. Notice that although the encoder ffitalic_f tries to maximize the multi-class rate reduction of the features 𝒁\bm{Z}bold_italic_Z of the data 𝑿\bm{X}bold_italic_X, the decoder ggitalic_g should minimize the rate reduction of the multi-class features 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG of the decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG. That is, the decoder ggitalic_g tries to use a minimal coding rate needed to achieve a good decoding quality.

Hence, the overall “multi-class” Stackelberg game for learning the closed-loop transcription, named CTRL-Multi, is

max𝜽min𝜼𝒯𝑿(𝜽,𝜼)\displaystyle\max_{\bm{\theta}}\min_{\bm{\eta}}\mathcal{T}_{\bm{X}}(\bm{\theta},\bm{\eta})roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT bold_italic_X end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) (5.2.18)
\displaystyle\doteq ΔRϵ(f(𝑿,𝜽))Expansive encode+ΔRϵ(h(𝑿,𝜽,𝜼))Compressive decode+k=1KΔRϵ(f(𝑿k,𝜽),h(𝑿k,𝜽,𝜼))Contrastive encode & Contractive decode\displaystyle\underbrace{\Delta R_{\epsilon}\big{(}f(\bm{X},\bm{\theta})\big{)}}_{\text{Expansive encode}}+\underbrace{\Delta R_{\epsilon}\big{(}h(\bm{X},\bm{\theta},\bm{\eta})\big{)}}_{\text{Compressive decode}}+\sum_{k=1}^{K}\underbrace{\Delta R_{\epsilon}\big{(}f(\bm{X}_{k},\bm{\theta}),h(\bm{X}_{k},\bm{\theta},\bm{\eta})\big{)}}_{\text{Contrastive encode \& Contractive decode}}under⏟ start_ARG roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_f ( bold_italic_X , bold_italic_θ ) ) end_ARG start_POSTSUBSCRIPT Expansive encode end_POSTSUBSCRIPT + under⏟ start_ARG roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_h ( bold_italic_X , bold_italic_θ , bold_italic_η ) ) end_ARG start_POSTSUBSCRIPT Compressive decode end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT under⏟ start_ARG roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_f ( bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) , italic_h ( bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , bold_italic_η ) ) end_ARG start_POSTSUBSCRIPT Contrastive encode & Contractive decode end_POSTSUBSCRIPT (5.2.19)
=\displaystyle== ΔRϵ(𝒁(𝜽))+ΔRϵ(𝒁^(𝜽,𝜼))+k=1KΔRϵ(𝒁k(𝜽),𝒁^k(𝜽,𝜼)),\displaystyle\Delta R_{\epsilon}\big{(}\bm{Z}(\bm{\theta})\big{)}+\Delta R_{\epsilon}\big{(}\hat{\bm{Z}}(\bm{\theta},\bm{\eta})\big{)}+\sum_{k=1}^{K}\Delta R_{\epsilon}\big{(}\bm{Z}_{k}(\bm{\theta}),\hat{\bm{Z}}_{k}(\bm{\theta},\bm{\eta})\big{)},roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ( bold_italic_θ ) ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ( bold_italic_θ , bold_italic_η ) ) + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) ) , (5.2.20)

subject to certain constraints (upper or lower bounds) on the first term and the second term.

Notice that, without the terms associated with the generative part hhitalic_h or with all such terms fixed as constant, the above objective is precisely the original MCR2 objective introduced in Chapter 3. In an unsupervised setting, if we view each sample (and its augmentations) as its own class, the above formulation remains exactly the same. The number of classes kkitalic_k is simply the number of independent samples. In addition, notice that the above game’s objective function depends only on (features of) the data 𝑿\bm{X}bold_italic_X, hence one can learn the encoder and decoder (parameters) without the need for sampling or matching any additional distribution (as typically needed in GANs or VAEs).

As a special case, if 𝑿\bm{X}bold_italic_X only has one class, the Stackelberg game reduces to a special “two-class” or “binary” form,111111As the first two rate reduction terms automatically become zero named CTRL-Binary,

max𝜽min𝜼𝒯𝑿b(𝜽,𝜼)ΔRϵ(f(𝑿,𝜽),h(𝑿,𝜽,𝜼))=ΔRϵ(𝒁(𝜽),𝒁^(𝜽,𝜼)),\max_{\bm{\theta}}\min_{\bm{\eta}}\mathcal{T}^{b}_{\bm{X}}(\bm{\theta},\bm{\eta})\doteq\Delta R_{\epsilon}\big{(}f(\bm{X},\bm{\theta}),h(\bm{X},\bm{\theta},\bm{\eta})\big{)}=\Delta R_{\epsilon}\big{(}\bm{Z}(\bm{\theta}),\hat{\bm{Z}}(\bm{\theta},\bm{\eta})\big{)},roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_X end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) ≐ roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_f ( bold_italic_X , bold_italic_θ ) , italic_h ( bold_italic_X , bold_italic_θ , bold_italic_η ) ) = roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ( bold_italic_θ ) , over^ start_ARG bold_italic_Z end_ARG ( bold_italic_θ , bold_italic_η ) ) , (5.2.21)

between 𝑿\bm{X}bold_italic_X and the decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG by viewing 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG as two classes {𝟎,𝟏}\{\bm{0},\bm{1}\}{ bold_0 , bold_1 }. Notice that this binary case resembles the formulation of the original GAN (5.2.3). Instead of using cross entropy, our formulation adopts a more refined rate-reduction measure, which has been shown to promote diversity in the learned representation in Chapter 3.

Sometimes, even when 𝑿\bm{X}bold_italic_X contains multiple classes/modes, one could still view all classes together as one class. Then, the above binary objective is to align the union distribution of all classes with their decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG. This is typically a simpler task to achieve than the multi-class one (5.2.20), since it does not require learning of a more refined multi-class CTRL for the data, as we will later see in experiments. Notice that one good characteristic of the above formulation is that all quantities in the objectives are measured in terms of rate reduction for the learned features (assuming features eventually become subspace Gaussians).

One may notice that the above learning framework draws inspiration from closed-loop error correction widely practiced in feedback control systems. The closed-loop mechanism is used to form an overall feedback system between the two encoding and decoding networks for correcting any “error” in the distributions between the data 𝒙\bm{x}bold_italic_x and the decoded 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG. Using terminology from control theory, one may view the encoding network ffitalic_f as a “sensor” for error feedback while the decoding network ggitalic_g as a “controller” for error correction. However, notice that here the “target” for control is not a scalar nor a finite dimensional vector, but a continuous distribution—in order for the distribution of 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG to match that of the data 𝒙\bm{x}bold_italic_x. This is in general a control problem in an infinite dimensional space. The space of possible diffeomorphisms of submanifolds that ffitalic_f tries to model is infinite-dimensional [Lee02]. Ideally, we hope when the sensor ffitalic_f and the controller ggitalic_g are optimal, the distribution of 𝒙\bm{x}bold_italic_x becomes a “fixed point” for the closed loop while the distribution of 𝒛\bm{z}bold_italic_z reaches a compact linear discriminative representation. Hence, the minimax programs (5.2.20) and (5.2.21) can also be interpreted as games between an error-feedback sensor and an error-reducing controller.

The remaining question is whether the above framework can indeed learn a good (autoencoding) representation of a given dataset? Before we give some formal theoretical justification (in the next subsection), we present some empirical results.

Visualizing correlation of features 𝒁\bm{Z}bold_italic_Z and decoded features 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG.

We visualize the cosine similarity between 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG learned from the multi-class objective (5.2.20) on MNIST, CIFAR-10 and ImageNet (10 classes), which indicates how close 𝒛^=fg(𝒛)\hat{\bm{z}}=f\circ g(\bm{z})over^ start_ARG bold_italic_z end_ARG = italic_f ∘ italic_g ( bold_italic_z ) is from 𝒛\bm{z}bold_italic_z. Results in Figure 5.8 show that 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG are aligned very well within each class. The block-diagonal patterns for MNIST are sharper than those for CIFAR-10 and ImageNet, as images in CIFAR-10 and ImageNet have more diverse visual appearances.

(a) MNIST
(a) MNIST
(a) MNIST
(b) CIFAR-10
(a) MNIST
(c) ImageNet
Figure 5.8: Visualizing the alignment between 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG: |𝒁𝒁^||\bm{Z}^{\top}\hat{\bm{Z}}|| bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_italic_Z end_ARG | and in the feature space for (a) MNIST, (b) CIFAR-10, and (c) ImageNet-10-Class.
Visualizing autoencoding of the data 𝑿\bm{X}bold_italic_X and the decoded 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG.

We compare some representative 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG on MNIST, CIFAR-10 and ImageNet (10 classes) to verify how close 𝒙^=gf(𝒙)\hat{\bm{x}}=g\circ f(\bm{x})over^ start_ARG bold_italic_x end_ARG = italic_g ∘ italic_f ( bold_italic_x ) is to 𝒙\bm{x}bold_italic_x. The results are shown in Figure 5.9, and visualizations are created from training samples. Visually, the autoencoded 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG faithfully captures major visual features from its respective training sample 𝒙\bm{x}bold_italic_x, especially the pose, shape, and layout. For the simpler dataset such as MNIST, autoencoded images are almost identical to the original.

(a) MNIST 𝑿 \bm{X} bold_italic_X
(a) MNIST 𝑿\bm{X}bold_italic_X
(a) MNIST 𝑿 \bm{X} bold_italic_X
(b) CIFAR-10 𝑿\bm{X}bold_italic_X
(a) MNIST 𝑿 \bm{X} bold_italic_X
(c) ImageNet 𝑿\bm{X}bold_italic_X
(a) MNIST 𝑿 \bm{X} bold_italic_X
(d) MNIST 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG
(a) MNIST 𝑿 \bm{X} bold_italic_X
(e) CIFAR-10 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG
(a) MNIST 𝑿 \bm{X} bold_italic_X
(f) ImageNet 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG
Figure 5.9: Visualizing the auto-encoding property of the learned closed-loop transcription (𝒙𝒙^=gf(𝒙)\bm{x}\approx\hat{\bm{x}}=g\circ f(\bm{x})bold_italic_x ≈ over^ start_ARG bold_italic_x end_ARG = italic_g ∘ italic_f ( bold_italic_x )) on MNIST, CIFAR-10, and ImageNet (zoom in for better visualization).

5.2.2 A Mixture of Low-Dimensional Gaussians

In the above, we have argued that it is possible to formulate the problem of learning a data distribution as a closed-loop autoencoding problem. We also saw empirically that such a scheme seems to work. The remaining question is when and why such a scheme should work. It is difficult to answer this question for the most general cases with arbitrary data distributions. Nevertheless, as usual, let us see if we can arrive at a rigorous justification for the ideal case when the data distribution is a mixture of low-dimensional subspaces or low-rank Gaussians. A clear characterization and understanding of this important special case would shed light on the more general cases.121212As most distributions with low-dimensional structures can be well-approximated by this family of distributions.

To this end, let us first suppose that 𝑿\bm{X}bold_italic_X is distributed according to a mixture of low-dimensional Gaussians, and the label (i.e., subspace assignment) for 𝑿\bm{X}bold_italic_X is given by 𝒚\bm{y}bold_italic_y. Then, let us set up a minimax optimization problem to learn the data distribution, say through learning an encoding of 𝑿\bm{X}bold_italic_X into representations 𝒁\bm{Z}bold_italic_Z which are supported on a mixture of orthogonal subspaces, and a decoding of 𝒁\bm{Z}bold_italic_Z back into 𝑿\bm{X}bold_italic_X. Then, the representation we want to achieve is maximized by the earlier-discussed version of the information gain, i.e., ΔRϵ(𝒁)=Rϵ(𝒁)k=1KRϵ(𝒁k)\Delta R_{\epsilon}(\bm{Z})=R_{\epsilon}(\bm{Z})-\sum_{k=1}^{K}R_{\epsilon}(\bm{Z}_{k})roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) = italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), which enforces that the representation 𝒁k\bm{Z}_{k}bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of each class kkitalic_k spans a subspace which is orthogonal to the supporting subspaces of other classes. The way to measure the consistency of the decoding is, as before, given by k=1KΔRϵ(𝒁k,𝒁^k)\sum_{k=1}^{K}\Delta R_{\epsilon}(\bm{Z}_{k},\hat{\bm{Z}}_{k})∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), which enforces that the representation 𝒁k\bm{Z}_{k}bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its autoencoding 𝒁^k\hat{\bm{Z}}_{k}over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for each class kkitalic_k span the same subspace. Thus, we can set up a simplified Stackelberg game:

max𝜽min𝜼{ΔRϵ(𝒁(𝜽))+k=1KΔRϵ(𝒁k(𝜽),𝒁^k(𝜽,𝜼))}\max_{\bm{\theta}}\min_{\bm{\eta}}\left\{\Delta R_{\epsilon}(\bm{Z}(\bm{\theta}))+\sum_{k=1}^{K}\Delta R_{\epsilon}(\bm{Z}_{k}(\bm{\theta}),\hat{\bm{Z}}_{k}(\bm{\theta},\bm{\eta}))\right\}roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT { roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ( bold_italic_θ ) ) + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) ) } (5.2.22)

Notice that this is a simpler setup than what is used in practice—there is no ΔRϵ(𝒁^)\Delta R_{\epsilon}(\hat{\bm{Z}})roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ) term, for instance, and we work in the supervised setting with class labels (although the techniques used to prove the following result are easy to extend to unsupervised formulations). Also, the consistency of the representations is only measured in a distribution-wise sense via ΔRϵ\Delta R_{\epsilon}roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT (though this may be substituted with a sample-wise distance metric such as the 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm if desired, and equivalent conclusions may be drawn, mutatis mutandis).

Under mild conditions, in order to realize the desired encoder and decoder which realize 𝒁\bm{Z}bold_italic_Z from a data source 𝑿\bm{X}bold_italic_X that is already distributed according to a mixture of correlated low-dimensional Gaussians, we only require a linear encoder and decoder to disentangle and whiten the Gaussians. We then study this setting in the case where 𝜽\bm{\theta}bold_italic_θ and 𝜼\bm{\eta}bold_italic_η parameterize matrices whose operator norm is constrained.

We want to understand what kinds of optima are learned in this setting. One suitable solution concept for this kind of game, where the decoder’s optimal solution is defined solely with respect to the encoder (and not, in particular, with respect to some other intrinsic property of the decoder), is a Stackelberg equilibrium where the decoder follows the encoder. Namely, at such an equilibrium, the decoder should optimize its objective; meanwhile the encoder should optimize its objective, given that whatever it picks, the decoder will pick an optimal response (which may affect the encoder objective). In game-theoretic terms, it is like the decoder goes second: it chooses its weights after the encoder, and both the encoder and decoder attempt to optimize their objective in light of this. It is computationally tractable to learn sequential equilibria via gradient methods via alternating optimization, where each side uses different learning rates. Detailed exposition of sequential equilibria is beyond the scope of this book and we provide more technical details in Section A.3. In this setting, we have the following result:

Theorem 5.1 ([PPC+23], Abridged).

Suppose that 𝐗\bm{X}bold_italic_X is distributed on a mixture of subspaces. Under certain realistic yet technical conditions, it holds that all sequential equilibria of (5.2.22) obey:

  • The 𝒁k\bm{Z}_{k}bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT lie on orthogonal subspaces and are isotropic on those subspaces, i.e., maximizing the information gain.

  • The autoencoding is self-consistent, i.e., the subspaces spanned by 𝒁k\bm{Z}_{k}bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒁^k\hat{\bm{Z}}_{k}over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the same for all kkitalic_k.

This notion of self-consistency is the most one can expect if there are only geometric assumptions on the data, i.e., there are no statistical assumptions. If we assume that the columns of 𝑿\bm{X}bold_italic_X is drawn from a low-rank Gaussian mixture model, then analogous versions of this theorem certify that 𝒁k\bm{Z}_{k}bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are also low-rank Gaussians whose covariance is isotropic, for instance. Essentially, this result validates, via the simple case of Gaussian mixtures on subspaces, that minimax games to optimize the information gain and self-consistency may achieve optimal solutions.

5.3 Continuous Learning Self-Consistent Representations

5.3.1 Class-wise Incremental Learning

As we have seen, deep neural networks have demonstrated a great ability to learn representations for hundreds or even thousands of classes of objects, in both discriminative and generative contexts. However, networks typically must be trained offline, with uniformly sampled data from all classes simultaneously. It has been known that when an (open-loop) network is updated to learn new classes without data from the old ones, previously learned knowledge will fall victim to the problem of catastrophic forgetting [MC89]. This is known in neuroscience as the stability-plasticity dilemma: the challenge of ensuring that a neural system can learn from a new environment while retaining essential knowledge from previous ones [Gro87].

In contrast, natural neural systems (e.g., animal brains) do not seem to suffer from such catastrophic forgetting at all. They are capable of developing new memory of new objects while retaining memory of previously learned objects. This ability, for either natural or artificial neural systems, is often referred to as incremental learning, continual learning, sequential learning, or lifelong learning [AR20].

While many recent works have highlighted how artificial neural systems can be trained in more flexible ways, the strongest existing efforts toward answering the stability-plasticity dilemma for artificial neural networks typically require either storing raw exemplars [RKS+17, CRE+19] or providing external mechanisms [KPR+17]. Raw exemplars, particularly in the case of high-dimensional inputs like images, are costly and difficult to scale, while external mechanisms—which typically include secondary networks and representation spaces for generative replay, incremental allocation of network resources, network duplication, or explicit isolation of used and unused parts of the network—require heuristics and incur hidden costs.

Here we are interested in an incremental learning setting that is similar to nature. It counters these existing practices with two key qualities.

  1. 1.

    The first is that it should be memory-based. When learning new classes, no raw exemplars of old classes are available to train the network together with new data. This implies that one has to rely on a compact and thus structured “memory” learned for old classes, such as incrementally learned generative representations of the old classes, as well as the associated encoding and decoding mappings [KK18].

  2. 2.

    The second is that it should be self-contained. Incremental learning takes place in a single neural system with a fixed capacity, and in a common representation space. The ability to minimize forgetting is implied by optimizing an overall learning objective, without external networks, architectural modifications, or resource allocation mechanisms.

The incoherent linear structures for features of different classes closely resemble how objects are encoded in different areas of the inferotemporal cortex of animal brains [CT17, BSM+20]. The closed-loop transcription 𝑿𝒁𝑿^𝒁^\bm{X}\rightarrow\bm{Z}\rightarrow\hat{\bm{X}}\rightarrow\hat{\bm{Z}}bold_italic_X → bold_italic_Z → over^ start_ARG bold_italic_X end_ARG → over^ start_ARG bold_italic_Z end_ARG also resembles popularly hypothesized mechanisms for memory formation [VST+20, JT20]. This leads to a question: since memory in the brain is formed in an incremental fashion, can the above closed-loop transcription framework also support incremental learning?

LDR memory sampling and replay.

The simple linear structures of LDR make it uniquely suited for incremental learning: the distribution of features 𝒁j\bm{Z}_{j}bold_italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of each previously learned class can be explicitly and concisely represented by a principal subspace 𝒮j\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the feature space. To preserve the memory of an old class jjitalic_j, we only need to preserve the subspace while learning new classes. To this end, we simply sample mmitalic_m representative prototype features on the subspace along its top rritalic_r principal components, and denote these features as 𝒁j,old\bm{Z}_{j,old}bold_italic_Z start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT. Because of the simple linear structures of LDR, we can sample from 𝒁j,old\bm{Z}_{j,old}bold_italic_Z start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT by calculating the mean and covariance of 𝒁j,old\bm{Z}_{j,old}bold_italic_Z start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT after learning class jjitalic_j. The storage required is extremely small, since we only need to store means and covariances, which are sampled from as needed. Suppose a total of ttitalic_t old classes have been learned so far. If prototype features, denoted 𝒁old[𝒁old1,,𝒁oldt]\bm{Z}_{old}\doteq[\bm{Z}^{1}_{old},\ldots,\bm{Z}^{t}_{old}]bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ≐ [ bold_italic_Z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , … , bold_italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ], for all of these classes can be preserved when learning new classes, the subspaces {𝒮j}j=1t\{\mathcal{S}_{j}\}_{j=1}^{t}{ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT representing past memory will be preserved as well. Details about sampling and calculating mean and covariance can be found in the work of [TDW+23].

Figure 5.10 : Overall framework of our closed-loop transcription-based incremental learning for a structured LDR memory. Only a single, entirely self-contained, encoding-decoding network is needed: for a new data class 𝑿 n ​ e ​ w \bm{X}_{new} bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , a new LDR memory 𝒁 n ​ e ​ w \bm{Z}_{new} bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is incrementally learned as a minimax game between the encoder and decoder subject to the constraint that old memory of past classes 𝒁 o ​ l ​ d \bm{Z}_{old} bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT is intact through the closed-loop transcription (or replay): 𝒁 o ​ l ​ d ≈ 𝒁 ^ o ​ l ​ d = f ​ ( g ​ ( 𝒁 o ​ l ​ d ) ) \bm{Z}_{old}\approx\hat{\bm{Z}}_{old}=f(g(\bm{Z}_{old})) bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ≈ over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = italic_f ( italic_g ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) ) .
Figure 5.10: Overall framework of our closed-loop transcription-based incremental learning for a structured LDR memory. Only a single, entirely self-contained, encoding-decoding network is needed: for a new data class 𝑿new\bm{X}_{new}bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT, a new LDR memory 𝒁new\bm{Z}_{new}bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is incrementally learned as a minimax game between the encoder and decoder subject to the constraint that old memory of past classes 𝒁old\bm{Z}_{old}bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT is intact through the closed-loop transcription (or replay): 𝒁old𝒁^old=f(g(𝒁old))\bm{Z}_{old}\approx\hat{\bm{Z}}_{old}=f(g(\bm{Z}_{old}))bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ≈ over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = italic_f ( italic_g ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) ).
Incremental learning LDR with an old-memory constraint.

Notice that, with the learned autoencoding (5.2.9), one can replay and use the images, say 𝑿^old=g(𝒁old,𝜼)\hat{\bm{X}}_{old}=g(\bm{Z}_{old},\bm{\eta})over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = italic_g ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , bold_italic_η ), associated with the memory features to avoid forgetting while learning new classes. This is typically how generative models have been used for prior incremental learning methods. However, with the closed-loop framework, explicitly replaying images from the features is not necessary. Past memory can be effectively preserved through optimization exclusively on the features themselves.

Consider the task of incrementally learning a new class of objects.131313Of course, one may also consider the more general setting where the task contains a small batch of new classes, without serious modification. We denote a corresponding new sample set as 𝑿new\bm{X}_{new}bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT. The features of 𝑿new\bm{X}_{new}bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT are denoted as 𝒁new(𝜽)=f(𝑿new,𝜽)\bm{Z}_{new}(\bm{\theta})=f(\bm{X}_{new},\bm{\theta})bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ( bold_italic_θ ) = italic_f ( bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , bold_italic_θ ). We concatenate them together with the prototype features of the old classes 𝒁old\bm{Z}_{old}bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT and form 𝒁=[𝒁new(𝜽),𝒁old]\bm{Z}=[\bm{Z}_{new}(\bm{\theta}),\bm{Z}_{old}]bold_italic_Z = [ bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ( bold_italic_θ ) , bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ]. We denote the replayed images from all features as 𝑿^=[𝑿^new(𝜽,𝜼),𝑿^old(𝜼)]\hat{\bm{X}}=[{\hat{\bm{X}}_{new}(\bm{\theta},\bm{\eta})},{\hat{\bm{X}}_{old}(\bm{\eta})}]over^ start_ARG bold_italic_X end_ARG = [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ( bold_italic_η ) ] although we do not actually need to compute or use them explicitly. We only need features of replayed images, denoted 𝒁^=f(𝑿^,𝜽)=[𝒁^new(𝜽,𝜼),𝒁^old(𝜽,𝜼)]\hat{\bm{Z}}=f(\hat{\bm{X}},\bm{\theta})=[{\hat{\bm{Z}}_{new}(\bm{\theta},\bm{\eta})},{\hat{\bm{Z}}_{old}(\bm{\theta},\bm{\eta})}]over^ start_ARG bold_italic_Z end_ARG = italic_f ( over^ start_ARG bold_italic_X end_ARG , bold_italic_θ ) = [ over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_η ) ].

Mirroring the motivation for the multi-class CTRL objective (5.2.20), we would like the features of the new class 𝒁new\bm{Z}_{new}bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT to be incoherent to all of the old ones 𝒁old\bm{Z}_{old}bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT. As 𝒁new\bm{Z}_{new}bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is the only new class whose features need to be learned, the objective (5.2.20) reduces to the case where k=1k=1italic_k = 1:

min𝜼max𝜽ΔRϵ(𝒁)+ΔRϵ(𝒁^)+ΔRϵ(𝒁new,𝒁^new).\min_{\bm{\eta}}\max_{\bm{\theta}}\Delta{R_{\epsilon}(\bm{Z})}+\Delta{R_{\epsilon}(\hat{\bm{Z}})}+\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}.roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) . (5.3.1)

However, when we update the network parameters (𝜽,𝜼)(\bm{\theta},\bm{\eta})( bold_italic_θ , bold_italic_η ) to optimize the features for the new class, the updated mappings ffitalic_f and ggitalic_g will change features of the old classes too. Hence, to minimize the distortion of the old class representations, we can try to enforce Cov(𝒁j,old)=Cov(𝒁^j,old)\mbox{Cov}(\bm{Z}_{j,old})=\mbox{Cov}(\hat{\bm{Z}}_{j,old})Cov ( bold_italic_Z start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT ) = Cov ( over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT ). In other words, while learning new classes, we enforce that the memory of old classes remains “self-consistent” through the transcription loop:

𝒁oldg(𝒛,𝜼)𝑿^oldf(𝒙,𝜽)𝒁^old.\bm{Z}_{old}\xrightarrow{\hskip 5.69054ptg(\bm{z},\bm{\eta})\hskip 5.69054pt}\hat{\bm{X}}_{old}\xrightarrow{\hskip 5.69054ptf(\bm{x},\bm{\theta})\hskip 5.69054pt}\ \hat{\bm{Z}}_{old}.bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_g ( bold_italic_z , bold_italic_η ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_f ( bold_italic_x , bold_italic_θ ) end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT . (5.3.2)

Mathematically, this is equivalent to setting

ΔRϵ(𝒁old,𝒁^old)j=1tΔRϵ(𝒁j,old,𝒁^j,old)=0.\Delta R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})\doteq\sum_{j=1}^{t}\Delta R_{\epsilon}(\bm{Z}_{j,old},\hat{\bm{Z}}_{j,old})=0.roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) ≐ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_j , italic_o italic_l italic_d end_POSTSUBSCRIPT ) = 0 .

Hence, the above minimax program (5.3.1) is revised as a constrained minimax game, which we refer to as incremental closed-loop transcription (i-CTRL). The objective of this game is identical to the standard multi-class CTRL objective (5.2.20), but includes just one additional constraint:

min𝜼max𝜽\displaystyle\min_{\bm{\eta}}\max_{\bm{\theta}}roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ΔRϵ(𝒁)+ΔRϵ(𝒁^)+ΔRϵ(𝒁new,𝒁^new)\displaystyle\Delta{R_{\epsilon}(\bm{Z})}+\Delta{R_{\epsilon}(\hat{\bm{Z}})}+\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) (5.3.3)
subject toΔR(𝒁old,𝒁^old)=0.\displaystyle\mbox{subject to}\quad\Delta R(\bm{Z}_{old},\hat{\bm{Z}}_{old})=0.subject to roman_Δ italic_R ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) = 0 .

In practice, the constrained minimax program can be solved by alternating minimization and maximization between the encoder f(,𝜽)f(\cdot,\bm{\theta})italic_f ( ⋅ , bold_italic_θ ) and decoder g(,𝜼)g(\cdot,\bm{\eta})italic_g ( ⋅ , bold_italic_η ) as follows:

max𝜽\displaystyle\max_{\bm{\theta}}roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ΔRϵ(𝒁)+ΔRϵ(𝒁^)+λΔRϵ(𝒁new,𝒁^new)γΔRϵ(𝒁old,𝒁^old),\displaystyle\Delta{R_{\epsilon}(\bm{Z})}\!+\!\Delta{R_{\epsilon}(\hat{\bm{Z}})}\!+\!\lambda\cdot\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}-\gamma\cdot\Delta{R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})},roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ) + italic_λ ⋅ roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) - italic_γ ⋅ roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) , (5.3.4)
min𝜼\displaystyle\min_{\bm{\eta}}roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT ΔRϵ(𝒁)+ΔRϵ(𝒁^)+λΔRϵ(𝒁new,𝒁^new)+γΔRϵ(𝒁old,𝒁^old);\displaystyle\Delta{R_{\epsilon}(\bm{Z})}\!+\!\Delta{R_{\epsilon}(\hat{\bm{Z}})}\!+\!\lambda\cdot\Delta{R_{\epsilon}(\bm{Z}_{new},\hat{\bm{Z}}_{new})}+\gamma\cdot\Delta{R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})};roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_Z end_ARG ) + italic_λ ⋅ roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) + italic_γ ⋅ roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) ; (5.3.5)

where the constraint ΔRϵ(𝒁old,𝒁^old)=0\Delta R_{\epsilon}(\bm{Z}_{old},\hat{\bm{Z}}_{old})=0roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ) = 0 in (5.3.3) has been converted (and relaxed) to a Lagrangian term with a corresponding coefficient γ\gammaitalic_γ and sign. We additionally introduce another coefficient λ\lambdaitalic_λ for weighting the rate reduction term associated with the new data.

Jointly optimal memory via incremental reviewing.

As we will see, the above constrained minimax program can already achieve state-of-the-art performance for incremental learning. Nevertheless, developing an optimal memory for all classes cannot rely on graceful forgetting alone. Even for humans, if an object class is learned only once, we should expect the learned memory to fade as we continue to learn new ones, unless the memory can be consolidated by reviewing old object classes.

To emulate this phase of memory forming, after incrementally learning a whole dataset, we may go back to review all classes again, one class at a time. We refer to going through all classes once as one reviewing “cycle”.141414to distinguish from the term “epoch” used in the conventional joint learning setting. If needed, multiple reviewing cycles can be conducted. It is quite expected that reviewing can improve the learned (LDR) memory. But somewhat surprisingly, the closed-loop framework allows us to review even in a “class-unsupervised” manner: when reviewing data of an old class, say 𝑿j\bm{X}_{j}bold_italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the system does not need the class label and can simply treat 𝑿j\bm{X}_{j}bold_italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as a new class 𝑿new\bm{X}_{new}bold_italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT. That is, the system optimizes the same constrained mini-max program (5.3.3) without any modification; after the system is optimized, one can identify the newly learned subspace spanned by 𝒁new\bm{Z}_{new}bold_italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT, and use it to replace or merge with the old subspace 𝒮j\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. As our experiments show, such a class-unsupervised incremental review process can gradually improve both discriminative and generative performance of the LDR memory, eventually converging to that of a jointly learned memory.

Experimental verification.

We show some experimental results on the following datasets: MNIST [LBB+98a] and CIFAR-10 [KNH14]. All experiments are conducted for the more challenging class-IL setting. For both MNIST and CIFAR-10, the 10 classes are split into 5 tasks with 2 classes each or 10 tasks with 1 class each. For the encoder ffitalic_f and decoder ggitalic_g, we adopt a very simple network architecture modified from DCGAN [RMC16], which is merely a four-layer convolutional network. Here we only show some qualitative visual results; more experiments and analytical analysis can be found in the work [TDW+23].

Visualizing auto-encoding properties.

We begin by qualitatively visualizing some representative images 𝑿\bm{X}bold_italic_X and the corresponding replayed 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG on MNIST and CIFAR-10. The model is learned incrementally with the datasets split into 5 tasks. Results are shown in Figure 5.11, where we observe that the reconstructed 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG preserves the main visual characteristics of 𝑿\bm{X}bold_italic_X including shapes and textures. For a simpler dataset like MNIST, the replayed 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG are almost identical to the input 𝑿\bm{X}bold_italic_X! This is rather remarkable given: (1) our method does not explicitly enforce 𝒙^𝒙\hat{\bm{x}}\approx\bm{x}over^ start_ARG bold_italic_x end_ARG ≈ bold_italic_x for individual samples as most autoencoding methods do, and (2) after having incrementally learned all classes, the generator has not forgotten how to generate digits learned earlier, such as 0, 1, 2. For a more complex dataset like CIFAR-10, we also demonstrate good visual quality, faithfully capturing the essence of each image.

(a) MNIST 𝑿 \bm{X} bold_italic_X
(a) MNIST 𝑿\bm{X}bold_italic_X
(a) MNIST 𝑿 \bm{X} bold_italic_X
(b) MNIST 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG
(a) MNIST 𝑿 \bm{X} bold_italic_X
(c) CIFAR-10 𝑿\bm{X}bold_italic_X
(a) MNIST 𝑿 \bm{X} bold_italic_X
(d) CIFAR-10 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG
Figure 5.11: Visualizing the auto-encoding property of the learned (𝑿^=gf(𝑿)\hat{\bm{X}}=g\circ f(\bm{X})over^ start_ARG bold_italic_X end_ARG = italic_g ∘ italic_f ( bold_italic_X )).
Principal subspaces of the learned features.

Most generative memory-based methods utilize autoencoders, VAEs, or GANs for replay purposes. The structure or distribution of the learned features 𝒁j\bm{Z}_{j}bold_italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for each class is unclear in the feature space. The features 𝒁j\bm{Z}_{j}bold_italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the LDR memory, on the other hand, have a clear linear structure. Figure 5.12 visualizes correlations among all learned features |𝒁𝒁||\bm{Z}^{\top}\bm{Z}|| bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z |, in which we observe clear block-diagonal patterns for both datasets.151515Notice that these patterns closely resemble the similarity matrix of response profiles of object categories from different areas of the inferotemporal cortex, as shown in Extended DataFig.3 of [BSM+20]. This indicates that the features for different classes 𝒁j\bm{Z}_{j}bold_italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT indeed lie on subspaces that are incoherent from one another. Hence, features of each class can be well modeled as a principal subspace in the feature space.

Figure 5.12 : Block diagonal structure of | 𝒁 ⊤ ​ 𝒁 | |\bm{Z}^{\top}\bm{Z}| | bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for MNIST (left) and CIFAR-10 (right).
Figure 5.12 : Block diagonal structure of | 𝒁 ⊤ ​ 𝒁 | |\bm{Z}^{\top}\bm{Z}| | bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for MNIST (left) and CIFAR-10 (right).
Figure 5.12: Block diagonal structure of |𝒁𝒁||\bm{Z}^{\top}\bm{Z}|| bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for MNIST (left) and CIFAR-10 (right).
Replay images of samples from principal components.

Since features of each class can be modeled as a principal subspace, we further visualize the individual principal components within each of those subspaces. Figure 5.13 shows the images replayed from sampled features along the top-4 principal components for different classes, on MNIST and CIFAR-10 respectively. Each row represents samples along one principal component, and they clearly show similar visual characteristics but are distinctively different from those in other rows. We see that the model remembers different poses of ‘4’ after having learned all remaining classes. For CIFAR-10, the incrementally learned memory remembers representative poses and shapes of horses and ships.

(a) sampled 𝒙 ^ o ​ l ​ d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘4’
(a) sampled 𝒙^old\hat{\bm{x}}_{old}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘4’
(a) sampled 𝒙 ^ o ​ l ​ d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘4’
(b) sampled 𝒙^old\hat{\bm{x}}_{old}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘7’
(a) sampled 𝒙 ^ o ​ l ​ d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘4’
(c) sampled 𝒙^old\hat{\bm{x}}_{old}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘horse’
(a) sampled 𝒙 ^ o ​ l ​ d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘4’
(d) sampled 𝒙^old\hat{\bm{x}}_{old}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of ‘ship’
Figure 5.13: Visualization of 5 reconstructed 𝒙^=g(𝒛)\hat{\bm{x}}=g(\bm{z})over^ start_ARG bold_italic_x end_ARG = italic_g ( bold_italic_z ) from 𝒛\bm{z}bold_italic_z’s with the closest distance to (top-4) principal components of learned features for MNIST (class ‘4’ and class ‘7’) and CIFAR-10 (class ‘horse’ and ‘ship’).
Effectiveness of incremental reviewing.

We verify how the incrementally learned LDR memory can be further consolidated with an unsupervised incremental reviewing phase described before. Experiments are conducted on CIFAR-10, with 10 steps. Figure 5.14 left shows replayed images of the first class ‘airplane’ at the end of incremental learning of all ten classes, sampled along the top-3 principal components – every two rows (16 images) are along one principal direction. Their visual quality remains very decent—we observe almost no forgetting. The right figure shows replayed images after reviewing the first class once. We notice a significant improvement in visual quality after reviewing, and principal components of the features in the subspace start to correspond to distinctively different visual attributes within the same class.

Figure 5.14 : Visualization of replayed images 𝒙 ^ o ​ l ​ d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of class 1-‘airplane’ in CIFAR-10, before (left) and after (right) one reviewing cycle.
Figure 5.14 : Visualization of replayed images 𝒙 ^ o ​ l ​ d \hat{\bm{x}}_{old} over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of class 1-‘airplane’ in CIFAR-10, before (left) and after (right) one reviewing cycle.
Figure 5.14: Visualization of replayed images 𝒙^old\hat{\bm{x}}_{old}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT of class 1-‘airplane’ in CIFAR-10, before (left) and after (right) one reviewing cycle.

5.3.2 Sample-wise Continuous Unsupervised Learning

As we know, the closed-loop CTRL formulation can already learn a decent autoencoding, even without class information, with the CTRL-Binary program:

max𝜽min𝜼ΔRϵ(𝒁,𝒁^)\displaystyle\max_{\bm{\theta}}\min_{\bm{\eta}}\quad\Delta R_{\epsilon}(\bm{Z},\hat{\bm{Z}})roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) (5.3.6)

However, note that (5.3.6) is practically limited because it only aligns the dataset 𝑿\bm{X}bold_italic_X and the regenerated 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG at the distribution level. There is no guarantee that each sample 𝒙\bm{x}bold_italic_x would be close to the decoded 𝒙^=g(f(𝒙))\hat{\bm{x}}=g(f(\bm{x}))over^ start_ARG bold_italic_x end_ARG = italic_g ( italic_f ( bold_italic_x ) ).

Sample-wise constraints for unsupervised transcription.

To improve discriminative and generative properties of representations learned in the unsupervised setting, we propose two additional mechanisms for the above CTRL-Binary maximin game (5.3.6). For simplicity and uniformity, here these will be formulated as equality constraints over rate reduction measures, but in practice they can be enforced softly during optimization.

Sample-wise self-consistency via closed-loop transcription.

First, to address the issue that CTRL-Binary does not learn a sample-wise consistent autoencoding, we need to promote 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG to be close to 𝒙\bm{x}bold_italic_x for each sample. In the CTRL framework, this can be achieved by enforcing their corresponding features 𝒛=f(𝒙)\bm{z}=f(\bm{x})bold_italic_z = italic_f ( bold_italic_x ) and 𝒛^=f(𝒙^)\hat{\bm{z}}=f(\hat{\bm{x}})over^ start_ARG bold_italic_z end_ARG = italic_f ( over^ start_ARG bold_italic_x end_ARG ) to be close. To promote sample-wise self-consistency, where 𝒙^=g(f(𝒙))\hat{\bm{x}}=g(f(\bm{x}))over^ start_ARG bold_italic_x end_ARG = italic_g ( italic_f ( bold_italic_x ) ) is close to 𝒙\bm{x}bold_italic_x , we want the distance between 𝒛\bm{z}bold_italic_z and 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG to be zero or small, for all NNitalic_N samples. This distance can be measured by the rate reduction:

iNΔRϵ(𝒛i,𝒛^i)=0.\displaystyle\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})=0.\vspace{-2mm}∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 . (5.3.7)

Note that this again avoids measuring differences in the image space.

Self-supervision via compressing augmented samples.

Since we do not know any class label information between samples in the unsupervised setting, the best we can do is to view every sample and its augmentations (say via translation, rotation, occlusion, etc.) as one “class”—a basic idea behind almost all self-supervised learning methods. In the rate reduction framework, it is natural to compress the features of each sample and its augmentations. In this work, we adopt the standard transformations in SimCLR [CKN+20] and denote such a transformation as τ\tauitalic_τ. We denote each augmented sample 𝒙a=τ(𝒙)\bm{x}_{a}=\tau(\bm{x})bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_τ ( bold_italic_x ), and its corresponding feature as 𝒛a=f(𝒙a,𝜽)\bm{z}_{a}=f(\bm{x}_{a},\bm{\theta})bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_italic_θ ). For discriminative purposes, we hope the classifier is invariant to such transformations. Hence it is natural to enforce that the features 𝒛a\bm{z}_{a}bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT of all augmentations are the same as those 𝒛\bm{z}bold_italic_z of the original sample 𝒙\bm{x}bold_italic_x. This is equivalent to requiring the distance between 𝒛\bm{z}bold_italic_z and 𝒛a\bm{z}_{a}bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, measured in terms of rate reduction again, to be zero (or small) for all NNitalic_N samples:

iNΔRϵ(𝒛i,𝒛ai)=0.\displaystyle\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})=0.\vspace{-3mm}∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 . (5.3.8)
Unsupervised representation learning via closed-loop transcription.

So far, we know the CTRL-Binary objective ΔRϵ(𝒁,𝒁^)\Delta R_{\epsilon}(\bm{Z},\hat{\bm{Z}})roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) in (5.3.6) helps align the distributions while sample-wise self-consistency (5.3.7) and sample-wise augmentation (5.3.8) help align and compress features associated with each sample. Besides consistency, we also want learned representations to be maximally discriminative for different samples (here viewed as different “classes”). Notice that the rate distortion term Rϵ(𝒁)R_{\epsilon}(\bm{Z})italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) measures the coding rate (hence volume) of all features.

Unsupervised CTRL.

Putting these elements together, we propose to learn a representation via the following constrained maximin program, which we refer to as unsupervised CTRL (u-CTRL):

maxθminη\displaystyle\max_{\theta}\min_{\eta}\quadroman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT Rϵ(𝒁)+ΔRϵ(𝒁,𝒁^)\displaystyle R_{\epsilon}(\bm{Z})+\Delta R_{\epsilon}(\bm{Z},\hat{\bm{Z}})italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) (5.3.9)
subject to iNΔRϵ(𝒛i,𝒛^i)=0,andiNΔRϵ(𝒛i,𝒛ai)=0.\displaystyle\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})=0,\;\;\mbox{and}\;\;\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})=0.∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 , and ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 .

Figure 5.15 illustrates the overall architecture of the closed-loop system associated with this program.

Figure 5.15 : Overall framework of closed-loop transcription for unsupervised learning. Two additional constraints are imposed on the Binary-CTRL method: 1) self-consistency for sample-wise features 𝒛 i \bm{z}^{i} bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒛 ^ i \hat{\bm{z}}^{i} over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , say 𝒛 i ≈ 𝒛 ^ i \bm{z}^{i}\approx\hat{\bm{z}}^{i} bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≈ over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; and 2) invariance/similarity among features of augmented samples 𝒛 i \bm{z}^{i} bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒛 a i \bm{z}_{a}^{i} bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , say 𝒛 i ≈ 𝒛 a i = f ​ ( τ ​ ( 𝒙 i ) , 𝜽 ) \bm{z}^{i}\approx\bm{z}_{a}^{i}=f(\tau(\bm{x}^{i}),\bm{\theta}) bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≈ bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( italic_τ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_italic_θ ) , where 𝒙 a i = τ ​ ( 𝒙 i ) \bm{x}^{i}_{a}=\tau(\bm{x}^{i}) bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_τ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is an augmentation of sample 𝒙 i \bm{x}^{i} bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT via some transformation τ ​ ( ⋅ ) \tau(\cdot) italic_τ ( ⋅ ) .
Figure 5.15: Overall framework of closed-loop transcription for unsupervised learning. Two additional constraints are imposed on the Binary-CTRL method: 1) self-consistency for sample-wise features 𝒛i\bm{z}^{i}bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒛^i\hat{\bm{z}}^{i}over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, say 𝒛i𝒛^i\bm{z}^{i}\approx\hat{\bm{z}}^{i}bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≈ over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT; and 2) invariance/similarity among features of augmented samples 𝒛i\bm{z}^{i}bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒛ai\bm{z}_{a}^{i}bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, say 𝒛i𝒛ai=f(τ(𝒙i),𝜽)\bm{z}^{i}\approx\bm{z}_{a}^{i}=f(\tau(\bm{x}^{i}),\bm{\theta})bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≈ bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( italic_τ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_italic_θ ), where 𝒙ai=τ(𝒙i)\bm{x}^{i}_{a}=\tau(\bm{x}^{i})bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_τ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is an augmentation of sample 𝒙i\bm{x}^{i}bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT via some transformation τ()\tau(\cdot)italic_τ ( ⋅ ).

In practice, the above program can be optimized by alternating maximization and minimization between the encoder f(,𝜽)f(\cdot,\bm{\theta})italic_f ( ⋅ , bold_italic_θ ) and the decoder g(,𝜼)g(\cdot,\bm{\eta})italic_g ( ⋅ , bold_italic_η ). We adopt the following optimization strategy that works well in practice, which is used for all subsequent experiments on real image datasets:

maxθRϵ(𝒁)+ΔRϵ(𝒁,𝒁^)λ1iNΔRϵ(𝒛i,𝒛ai)λ2iNΔRϵ(𝒛i,𝒛^i);\displaystyle\max_{\theta}\;R_{\epsilon}(\bm{Z})+\Delta{R_{\epsilon}(\bm{Z},\hat{\bm{Z}})-\lambda_{1}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})}-\lambda_{2}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i});roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ; (5.3.10)
minηRϵ(𝒁)+ΔRϵ(𝒁,𝒁^)+λ1iNΔRϵ(𝒛i,𝒛ai)+λ2iNΔRϵ(𝒛i,𝒛^i),\displaystyle\min_{\eta}\;R_{\epsilon}(\bm{Z})+\Delta{R_{\epsilon}(\bm{Z},\hat{\bm{Z}})+\lambda_{1}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})+\lambda_{2}\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})},roman_min start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) + roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (5.3.11)

where the constraints iNΔRϵ(𝒛i,𝒛^i)=0\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\hat{\bm{z}}^{i})=0∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 and iNΔRϵ(𝒛i,𝒛ai)=0\sum_{i\in N}\Delta R_{\epsilon}(\bm{z}^{i},\bm{z}_{a}^{i})=0∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 in (5.3.9) have been converted (and relaxed) to Lagrangian terms with corresponding coefficients λ1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.161616Notice that computing the rate reduction terms ΔR\Delta Rroman_Δ italic_R for all samples or a batch of samples requires computing the expensive logdet\log\detroman_log roman_det of large matrices. In practice, from the geometric meaning of ΔR\Delta Rroman_Δ italic_R for two vectors, ΔR\Delta Rroman_Δ italic_R can be approximated with an 2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm or the cosine distance between two vectors.

The above representation is learned without class information. In order to facilitate discriminative or generative tasks, it must be highly structured. It has been verified experimentally that this is indeed the case and u-CTRL demonstrates significant advantages over other incremental or unsupervised learning methods [TDC+24]. We here only illustrate some qualitative results with the experiment on the CIFAR-10 dataset [KNH14], with standard augmentations for self-supervised learning [CKN+20]. One may refer to [TDC+24] for experiments on more and larger datasets and their quantitative evaluations.

As one can see from the experiments, specific and unique structure indeed emerges naturally in the representations learned using u-CTRL: globally, features of images in the same class tend to be clustered well together and separated from other classes, as shown in Figure 5.16; locally, features around individual samples exhibit approximately piecewise linear low-dimensional structures, as shown in Figure 5.17.

Figure 5.16 : Emergence of block-diagonal structures of | 𝒁 ⊤ ​ 𝒁 | |\bm{Z}^{\top}\bm{Z}| | bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for CIFAR-10.
Figure 5.16: Emergence of block-diagonal structures of |𝒁𝒁||\bm{Z}^{\top}\bm{Z}|| bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z | in the feature space for CIFAR-10.
(a) u-CTRL
(a) u-CTRL
(a) u-CTRL
(b) MoCoV2
Figure 5.17: t-SNE visualizations of learned features of CIFAR-10 with different models.
Unsupervised conditional image generation via rate reduction.

The highly-structured feature distribution also suggests that the learned representation can be very useful for generative purposes. For example, we can organize the sample features into meaningful clusters, and model them with low-dimensional (Gaussian) distributions or subspaces. By sampling from these compact models, we can conditionally regenerate meaningful samples from computed clusters. This is known as unsupervised conditional image generation [HKJ+21].

To cluster features, we exploit the fact that the rate reduction framework (3.4.12) is inspired by unsupervised clustering via compression [MDH+07a], which provides a principled way to find the membership 𝚷\bm{\Pi}bold_Π. Concretely, we maximize the same rate reduction objective (3.4.12) over 𝚷\bm{\Pi}bold_Π, but fix the learned representation 𝒁\bm{Z}bold_italic_Z instead. We simply view the membership 𝚷\bm{\Pi}bold_Π as a nonlinear function of the features 𝒁\bm{Z}bold_italic_Z, say h𝝅(,ξ):𝒁𝚷h_{\bm{\pi}}(\cdot,\xi):\bm{Z}\mapsto\bm{\Pi}italic_h start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( ⋅ , italic_ξ ) : bold_italic_Z ↦ bold_Π with parameters ξ\xiitalic_ξ. In practice, we model this function with a simple neural network, such as an MLP head right after the output feature 𝒛\bm{z}bold_italic_z. To estimate a “pseudo” membership 𝚷^\hat{\bm{\Pi}}over^ start_ARG bold_Π end_ARG of the samples, we solve the following optimization problem over 𝚷\bm{\Pi}bold_Π:

𝚷^=argmaxξΔRϵ(𝒁|𝚷(ξ)).\displaystyle\hat{\bm{\Pi}}=\arg\max_{\xi}\Delta R_{\epsilon}(\bm{Z}|\bm{\Pi}(\xi)).over^ start_ARG bold_Π end_ARG = roman_arg roman_max start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z | bold_Π ( italic_ξ ) ) . (5.3.12)

In Figure 5.18, we visualize images generated from the ten unsupervised clusters from (5.3.12). Each block represents one cluster and each row represents one principal component for each cluster. Despite learning and training without labels, the model not only organizes samples into correct clusters, but is also able to preserve statistical diversities within each cluster/class. We can easily recover the diversity within each cluster by computing different principal components and then sampling and generating accordingly. While the experiments presented here are somewhat limited in scale, we will explore more direct and powerful methods that utilize the learned data distributions and representations for conditional generation and estimation in the next chapter.

Figure 5.18 : Unsupervised conditional image generation from each cluster of CIFAR-10, using u-CTRL. Images from different rows mean generation from different principal components of each cluster.
Figure 5.18: Unsupervised conditional image generation from each cluster of CIFAR-10, using u-CTRL. Images from different rows mean generation from different principal components of each cluster.

5.4 Summary and Notes

Historically, autoencoding has been one of the important drivers of research innovation in neural networks for learning, although the most practically impressive demonstrations of deep learning have probably been in other domains (such as discriminative classification, with AlexNet [KSH12], or generative modeling with GPT architectures [BMR+20]). Works we have featured throughout the chapter, especially the work of [HS06], served as catalysts of research interest in neural networks during times when they were otherwise not prominent in the machine learning research landscape. In modern practice, autoencoders remain core components of many large-scale systems for generating highly structured data such as visual data, speech data, and molecular data, especially the VQ-VAE approach [OVK17], which builds on the variational autoencoder methodology we discussed in Section 5.1.4. The core problem of autoencoding remains of paramount intellectual importance due to its close connection with representation learning, and we anticipate that it will reappear on the radar of practical researchers in the future as efficiency in training and deploying large models continues to become more important.

Materials presented in the second half of this chapter are based on a series of recent work on the topic of closed-loop transcription: [DTL+22], [PPC+23], [TDW+23], and [TDC+24]. In particular, Section 5.2.1 is based on the pioneering work of [DTL+22]. After that, the work of [PPC+23] has provided strong theoretical justifications for the closed-loop framework, at least for an ideal case. Section 5.3.1 and Section 5.3.2 are based on the works of [TDW+23] and [TDC+24], respectively. They demonstrate that the closed-loop framework naturally supports incremental and continuous learning, either in a class-wise or sample-wise setting. The reader may refer to these papers for more technical and experimental details.

Shallow vs. deep neural networks, for autoencoding and more.

In Section 5.1.2, we discussed Cybenko’s universal approximation theorem and how it states that in principle, a neural network with a single hidden layer (and suitable elementwise nonlinearities) is sufficient to approximate any suitably regular target function. Of course, in practice, the major architectural reason for the dominance of neural networks in practice has been the refinement of techniques for training deeper neural networks. Why is depth necessary? From a fundamental point of view, the issue of depth separations, which construct settings where a deeper neural network can approximate a given class of target functions with exponentially-superior efficiency relative to a shallow network, has been studied at great length in the theoretical literature: examples include [Tel16, BN20, VJO+21]. The ease of training deeper networks in practice has not received as satisfying an answer from the perspective of theory. ResNets [HZR+16a] represent the pioneering empirical work of making deeper networks more easily trainable, used in nearly all modern architectures in some form. Theoretical studies have focused heavily on trainability of very deep networks, quantified via the initial neural tangent kernel [BGW21, MBD+21], but these studies have not given significant insight into the trainability benefits of deeper networks in middle and late stages of training (but see [YH21] for the root of a line of research attempting to address this).

5.5 Exercises and Extensions

Exercise 5.1 (Conceptual Understanding of Manifold Flattening).

Consider data lying on a curved manifold \mathcal{M}caligraphic_M embedded in D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT (like a curved surface in 333-dimensional space), as discussed in the manifold flattening subsection of Section 5.1.2. In this exercise, we will describe the basic ingredients of the manifold flattening algorithm from [PPR+24]. A manifold is called flat if it is an open set in Euclidean space (or more generally, an open set in a subspace).

  1. 1.

    Suppose the manifold \mathcal{M}caligraphic_M is a graph: this means that D1×D2\mathcal{M}\subset\mathbb{R}^{D_{1}}\times\mathbb{R}^{D_{2}}caligraphic_M ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (say), and that there is a function F:D1D2F:\mathbb{R}^{D_{1}}\to\mathbb{R}^{D_{2}}italic_F : blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT such that

    ={(𝒙,F(𝒙)𝒙D1)}.\mathcal{M}=\{(\bm{x},F(\bm{x})\mid\bm{x}\in\mathbb{R}^{D_{1}})\}.caligraphic_M = { ( bold_italic_x , italic_F ( bold_italic_x ) ∣ bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) } . (5.5.1)

    Give an integer D3D_{3}italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and a map f:D1×D2D3f:\mathbb{R}^{D_{1}}\times\mathbb{R}^{D_{2}}\to\mathbb{R}^{D_{3}}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT that flattens \mathcal{M}caligraphic_M, and describe the corresponding (lossless) reconstruction procedure from the flattened representation.

  2. 2.

    Now suppose that \mathcal{M}caligraphic_M is a general smooth manifold. Smooth manifolds have the property that they are locally well-approximated by subspaces near each point. Describe in intuitive terms how to flatten the manifold \mathcal{M}caligraphic_M locally at each point, by relating it to a graph. (The algorithm of [PPR+24] performs a refined version of this local flattening process in a way that allows them to be glued together to form a global flattening.)

Exercise 5.2 (Reproduce Closed-Loop Transcription).

Implement a closed-loop transcription pipeline for representation learning on the CIFAR-10 dataset following the methodology in Section 5.2.1. Reference [DTL+22] for useful hyperparameters and architecture settings. Reproduce the result in Figure 5.8.