Chapter 4 Deep Representations as Unrolled Optimization

In this chapter

White-Box Deep Networks via Unrolled Optimization
Deep Networks from Unrolled Gradient Descent Convolutional Networks from Invariant Rate Reduction
White-Box Transformers from Unrolled Optimization
Unrolled Optimization for Sparse Rate Reduction Overall White-Box Transformer Architecture: CRATE
Variants of Deep Architectures by Design
Attention-Only Transformer Architecture Linear-Time Attention: Token Statistics Transformer
Summary and Notes
Exercises and Extensions

“What I cannot create, I do not understand.”

— Richard Feynman

In previous chapters, we have shown how to identify low-dimensional structures in high-dimensional spaces, mainly focusing on linear structures. For example, we introduced principal component analysis (PCA) to learn the linear denoiser $\hat{\bm{U}}^{\top}$ when the observed data $\bm{x}$ follow the statistical model $\bm{x}=\bm{U}\bm{z}+\bm{\varepsilon}$ . In this setting, the learned representations are linearly transformed input data $\hat{\bm{U}}^{\top}\bm{x}$ . Under the linear model assumption, one can learn the low-dimensional linear structure with efficient optimization algorithms and strong theoretical guarantees. Moreover, the linear model assumption covers a wide range of applications and problems, including face recognition, magnetic resonance image recovery, and structure texture recovery [WM22].

On the other hand, the linear model can be limited when dealing with real-world applications, especially when the input data $\bm{x}$ is complex, such as speech and natural languages, images and videos, and robotic motions. The low-dimensional distributions of such data are typically nonlinear. How to deal with nonlinearity has a long history across different disciplines such as control theory, signal processing, and pattern recognition. There have been considerable efforts that try to extend methods and solutions for linear models to handle nonlinearity, including early effort to extend PCA to nonlinear PCA (as we will study in more detail in Chapter 5). In most cases, the methods are designed based on certain assumptions about the data distributions and tailored to specific problems.

More recently, deep neural networks have achieved remarkable success across a wide range of data and applications. A neural network

f(\cdot,\bm{\theta})\colon\bm{x}\xrightarrow{\hskip 2.84526ptf^{0}\hskip 2.84526pt}\bm{z}^{0}\rightarrow\cdots\rightarrow\bm{z}^{\ell}\xrightarrow{\hskip 2.84526ptf^{\ell}\hskip 2.84526pt}\bm{z}^{\ell+1}\rightarrow\cdots\to\bm{z}^{L}=\bm{z}.

(4.0.1)

can learn effective features/representations for downstream applications. For example, a trained deep neural network $f(\cdot,\bm{\theta})$ can be applied to map images to feature vectors, that is, $\bm{z}_{i}=f(\bm{x}_{i},\bm{\theta})$ , while a linear classifier can be learned on top of such representations $\{\bm{z}_{i}\}$ . One notable breakthrough is AlexNet [KSH12], a deep convolutional neural network trained with more than a million natural images, outperforming all previous approaches that were based on hand-crafted features. One of the key differences between AlexNet and previous approaches is that the former learns parameters of the nonlinear transformation from massive amounts of data trained with back-propagation (BP) [RHW86a], as detailed in Section A.2.3 of Appendix A.

Subsequent popular practice models the mapping $f$ with other empirically designed artificial deep neural networks and learns the parameters $\bm{\theta}$ from random initialization via BP. Starting with the AlexNet [KSH12], the architectures of modern deep networks continue to be empirically revised and improved. Network architectures such as VGG [SZ14], ResNet [HZR+16a], DenseNet [HLV+17], CNN, RNN or LSTM [HS97], Transformer [VSP+17], and a mixture of experts (MoE) [SMM+17, FZS22], etc. have continued to push the performance envelope. As part of the effort to improve the performance of deep networks, almost every component of the networks has been empirically scrutinized, and various revisions and improvements have been proposed. They are not limited to nonlinear activation functions [MHN13, KUM+17, XWC+15, NIG+18], skip connections [RFB15, HZR+16a], normalizations [IS15, BKH16, UVL16, WH18, MKK+18], up/down sampling or pooling [SMB10], convolutions [LBB+98a, KSH12], etc. However, almost all such modifications have been developed through years of empirical trial and error or ablation studies. Some recent practices even take to the extreme by searching for effective network structures and training strategies through extensive random search techniques, such as Neural Architecture Search [ZL17, BGN+17], AutoML [HKV19], and Learning to Learn [ADG+16].

Despite the wide application of deep neural networks, it is not clear what the underlying design principles of such a constructed network are. In particular, it is not clear what mathematical function each layer of the network performs. In this chapter, based on the results from previous chapters, we develop a principled framework that will provide a fully rigorous mathematical interpretation of the role of a deep network, including its individual layers and the network as a whole.

To understand deep networks and how they should be better designed, we must start with the objective of representation learning. In previous chapters, we have argued that the objective is to identify the intrinsically low-dimensional data distribution and then transform it to a compact and structured (say piecewise linear) representation. As we have seen in the previous chapter, the general approach to identifying a low-dimensional data distribution is through a compression process that progressively minimizes the entropy or coding rate of the distribution. However, up to this point, we have been using empirically designed deep networks to model or approximate the operations that aim to optimize these objectives, such as the score function for denoising (in Section 1.3.1) or the transformation that maximizes the rate reduction (in Section 3.4.3).

As we have argued in the previous chapter, Section 3.4 in particular, one can measure the goodness of the resulting representation by the information of the representation gained from a “lazy” representation which models all data as one big Gaussian.¹¹1that we have seen in the previous chapter as one particular choice of interpretation of the sampled dataset. In particular, if we use a mixture of Gaussians (subspaces)²²2which we have studied thoroughly in the previous chapter. as prototypical distributions to approximate the non-linear distribution of interest, then we can efficiently measure the coding rate of such a representation using the (sum of) rate distortion functions of the associated Gaussians. Then the amount of information gained or (relative) entropy reduced with such a modeling can be measured by the difference between the coding rate for the lazy representation and that for the more refined representation. Then, the objective of representation learning is to maximize this information gain, also known as the rate reduction objective.

As we will see in this chapter, once the objective of representation learning is clear, the role of a deep neural network is precisely to help optimize the objective iteratively. Each layer of a deep neural network can be naturally derived as an iterative optimization step to incrementally maximize the information gain, including the popular architectures of ResNet, CNN, and Transformer, and other more advanced variants. In particular, this chapter aims to answer the following questions about deep networks:

•

Section 4.1 — given a measure of goodness for a learned representation, how to construct the nonlinear mapping from the data to the optimal representation via unrolled optimization for the objective?
•

Section 4.2 — how would the above unrolling approach provide a principled interpretation of the popular transformer architectures; if so, what are the associated objective and optimization mechanisms?
•

Section 4.3 — how would this framework guide us to design more efficient or more parsimonious deep architectures?

4.1 White-Box Deep Networks via Unrolled Optimization

Now, if we agree that maximizing the rate reduction or information gain leads to the desired representation as discussed in Section 3.4, the remaining question is how to construct and learn a (nonlinear) mapping from the data $\bm{X}$ to the optimal representation $\bm{Z}^{*}$ . This involves designing a network architecture and learning algorithm that can effectively capture the underlying structures in the data and faithfully realize the optimal representation.

4.1.1 Deep Networks from Unrolled Gradient Descent

In the previous chapter, we presented the rate reduction objective (3.4.12) as a principled objective for learning linear discriminative representations of the data. We have, however, not specified the architecture of the feature mapping $\bm{z}=f(\bm{x},\bm{\theta})$ for extracting such representations from input data $\bm{x}$ . A straightforward choice is to use a conventional deep network, such as ResNet, for implementing $f(\bm{x},\bm{\theta})$ . As we have seen in Figure 3.24, such a choice often leads to decent performance empirically. Nonetheless, there remain several unanswered problems with adopting an arbitrary deep network. Although the learned feature representation is now more interpretable, the network itself is still not. It is unclear why any chosen “black-box” network is able to optimize the desired MCR² objective at all. The good empirical results (say with a ResNet) do not necessarily justify the particular choice in architectures and operators of the network: Why is a deep layered model even necessary; what do additional layers try to improve or simplify; how wide and deep is adequate; or is there any rigorous justification for the convolutions (in a popular multi-channel form) and nonlinear operators (e.g. ReLU or softmax) used?

In this chapter, we show that using gradient ascent to maximize the rate reduction $\Delta R_{\epsilon}(\bm{Z}\mid\bm{\Pi})$ as defined in (3.4.12) naturally leads to a “white-box” deep network that realizes the desired mapping. All network layers, linear/nonlinear operators, and parameters are explicitly constructed in a purely forward propagation fashion. Moreover, such network architectures resemble existing empirically-designed deep networks, providing principled justifications for their design.

Gradient Ascent for Coding Rate Reduction.

From the previous chapter, we see that to seek a linear discriminative representation (LDR), mathematically, we are essentially seeking a continuous mapping $f(\cdot):\bm{x}\mapsto\bm{z}$ from the data $\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{N}]\in\mathbb{R}^{D\times N}$ (or initial features extracted from the data³³3As we will see the necessity of such a feature extraction in the next section.) to an optimal representation $\bm{Z}=[\bm{z}_{1},\ldots,\bm{z}_{N}]\in\mathbb{R}^{d\times N}$ that maximizes the following coding rate reduction objective:

\begin{split}\Delta R_{\epsilon}(\bm{Z}\mid\bm{\Pi})\doteq\underbrace{\frac{1}{2}\log\det\Big{(}\bm{I}+{\alpha}\bm{Z}\bm{Z}^{\top}\Big{)}}_{R_{\epsilon}(\bm{Z})}\;-\;\underbrace{\sum_{k=1}^{K}\frac{\gamma_{k}}{2}\log\det\Big{(}\bm{I}+{\alpha_{k}}\bm{Z}\bm{\Pi}_{k}\bm{Z}^{\top}\Big{)}}_{R_{\epsilon}^{c}(\bm{Z}\mid\bm{\Pi})},\end{split}

(4.1.1)

where $\epsilon>0$ is a prescribed quantization error and for simplicity we denote⁴⁴4Notice our use of slightly simplified notation compared to Chapter 3.

\displaystyle\alpha\doteq\frac{d}{N\epsilon^{2}},\qquad\alpha_{k}\doteq\frac{d}{\mathrm{tr}(\bm{\Pi}_{k})\epsilon^{2}},\qquad\gamma_{k}\doteq\frac{\mathrm{tr}(\bm{\Pi}_{k})}{N},\qquad\text{for}\ k=1,\ldots,K.

The question really boils down to whether there is a constructive way of finding such a continuous mapping $f(\cdot,\bm{\theta})$ from $\bm{x}$ to $\bm{z}$ ? To this end, let us consider incrementally maximizing the objective $\Delta R_{\epsilon}$ as a function of $\bm{Z}\subseteq\mathbb{S}^{d-1}$ . Although there might be many optimization schemes to choose from, for simplicity we first consider the arguably simplest projected gradient ascent (PGA) scheme:⁵⁵5Notice that we use subscript $j$ on $\bm{Z}_{j}$ to indicate features in the $j$ -th class and superscript $\ell$ on $\bm{Z}^{\ell}$ to indicate all features at $\ell$ -th iteration or layer.

\bm{Z}^{\ell+1}\;\propto\;\bm{Z}^{\ell}+\eta\cdot\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}(\bm{Z}^{\ell})\quad\mbox{s.t.}\quad\bm{Z}^{\ell+1}\subseteq\mathbb{S}^{d-1},\quad\ell=1,2,\ldots,

(4.1.2)

for some step size $\eta>0$ and the iterate starts with the given data $\bm{Z}^{0}=\bm{X}$ .⁶⁶6Again, for simplicity, we here first assume the initial features $\bm{Z}^{1}$ are the data themselves. Note that here $\ell$ denotes the number of iterations. Hence, the data and the features have the same dimension $d$ . This needs not to be the case though. As we will see in the next section, the initial features can be some (lifted) features of the data to begin with and could in principle have a different (much higher) dimension. All subsequent iterates have the same dimension. This scheme can be interpreted as how one should incrementally adjust locations of the current features $\bm{Z}^{\ell}$ , initialized as the input data $\bm{X}$ , in order for the resulting $\bm{Z}^{\ell+1}$ to improve the rate reduction $\Delta R_{\epsilon}$ , as illustrated in Figure 4.1.

Figure 4.1 : Incremental deformation via gradient flow to both flatten data of each class into a subspace and push different classes apart. — Figure 4.1: Incremental deformation via gradient flow to both flatten data of each class into a subspace and push different classes apart.

Simple calculation shows that the gradient ${\partial\Delta R_{\epsilon}}/{\partial\bm{Z}}$ entails evaluating the following derivatives of the two terms in $\Delta R_{\epsilon}$ :

\frac{1}{2}\frac{\partial\log\det(\bm{I}\!+\!\alpha\bm{Z}\bm{Z}^{\top})}{\partial\bm{Z}}(\bm{Z}^{\ell})=\underbrace{\alpha(\bm{I}\!+\!\alpha\bm{Z}^{\ell}(\bm{Z}^{\ell})^{\top})^{-1}}_{\bm{E}^{\ell}\;\in\mathbb{R}^{d\times d}}\bm{Z}^{\ell},

(4.1.3)

\frac{1}{2}\frac{\partial\left(\gamma_{k}\log\det(\bm{I}+\alpha_{k}\bm{Z}\bm{\Pi}_{k}\bm{Z}^{\top})\right)}{\partial\bm{Z}}(\bm{Z}^{\ell})=\gamma_{k}\underbrace{\alpha_{k}(\bm{I}+\alpha_{k}\bm{Z}^{\ell}\bm{\Pi}_{k}(\bm{Z}^{\ell})^{\top})^{-1}}_{\bm{C}^{\ell}_{k}\;\in\mathbb{R}^{d\times d}}\bm{Z}^{\ell}\bm{\Pi}_{k}.

(4.1.4)

Notice that in the above, the matrix $\bm{E}^{\ell}$ only depends on $\bm{Z}^{\ell}$ and it aims to expand all the features to increase the overall coding rate; the matrix $\bm{C}^{\ell}_{k}$ depends on features from the $k$ -class and aims to compress them to reduce the coding rate of each class. Then the complete gradient $\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}(\bm{Z}^{\ell})\in\mathbb{R}^{d\times N}$ is of the form:

\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}(\bm{Z}^{\ell})=\underbrace{\bm{E}^{\ell}}_{\text{Expansion}}\bm{Z}^{\ell}\;-\;\sum_{k=1}^{K}\gamma_{k}\underbrace{\bm{C}_{k}^{\ell}}_{\text{Compression}}\bm{Z}^{\ell}\bm{\Pi}_{k}.

(4.1.5)

Remark 4.1 (Interpretation of $\bm{E}^{\ell}$ and $\bm{C}_{j}^{\ell}$ as linear operators).

For any $\bm{z}^{\ell}\in\mathbb{R}^{d}$ ,

\displaystyle\bm{E}^{\ell}\bm{z}^{\ell}=\alpha(\bm{z}^{\ell}-\bm{Z}^{\ell}\bm{q}^{\ell}_{\star}),\qquad\mbox{where}\qquad\bm{q}^{\ell}_{\star}\doteq\operatorname*{arg\ min}_{\bm{q}^{\ell}}\big{\{}\alpha\|\bm{z}^{\ell}-\bm{Z}^{\ell}\bm{q}^{\ell}\|_{2}^{2}+\|\bm{q}^{\ell}\|_{2}^{2}\big{\}}.

(4.1.6)

Notice that $\bm{q}^{\ell}_{\star}$ is exactly the solution to the ridge regression by all the data points $\bm{Z}^{\ell}$ concerned. Therefore, $\bm{E}^{\ell}$ (similarly for $\bm{C}^{\ell}_{k}$ ) is approximately (i.e., when $N$ is large enough) the projection onto the orthogonal complement of the subspace spanned by columns of $\bm{Z}^{\ell}$ . Another way to interpret the matrix $\bm{E}^{\ell}$ is through eigenvalue decomposition of the covariance matrix $\bm{Z}^{\ell}(\bm{Z}^{\ell})^{\top}$ . Assuming that $\bm{Z}^{\ell}(\bm{Z}^{\ell})^{\top}\doteq\bm{U}^{\ell}\bm{\Lambda}^{\ell}(\bm{U}^{\ell})^{\top}$ where $\bm{\Lambda}^{\ell}\doteq\operatorname{diag}\left(\lambda^{\ell}_{1},\ldots,\lambda^{\ell}_{d}\right)$ , we have

\bm{E}^{\ell}=\alpha\bm{U}^{\ell}\,\operatorname{diag}\left(\frac{1}{1+\alpha\lambda^{\ell}_{1}},\ldots,\frac{1}{1+\alpha\lambda^{\ell}_{d}}\right)\left(\bm{U}^{\ell}\right)^{\top}.

(4.1.7)

Therefore, the matrix $\bm{E}^{\ell}$ operates on a vector $\bm{z}^{\ell}$ by stretching in a way that directions of large variance are shrunk while directions of vanishing variance are kept. These are exactly the directions (4.1.3) in which we move the features so that the overall volume expands and the coding rate will increase, hence the positive sign. To the opposite effect, the directions associated with (4.1.4) are “residuals” of features of each class that deviate from the subspace to which they are supposed to belong. These are exactly the directions in which the features need to be compressed back onto their respective subspace, hence the negative sign (see Figure 4.2).

Figure 4.2 : Interpretation of 𝑪 k ℓ \bm{C}^{\ell}_{k} bold_italic_C start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝑬 ℓ \bm{E}^{\ell} bold_italic_E start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT : 𝑪 k ℓ \bm{C}^{\ell}_{k} bold_italic_C start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT compresses each class by contracting the features to a low-dimensional subspace; 𝑬 ℓ \bm{E}^{\ell} bold_italic_E start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT expands all features by contrasting and repelling features across different classes. — Figure 4.2: Interpretation of $\bm{C}^{\ell}_{k}$ and $\bm{E}^{\ell}$ : $\bm{C}^{\ell}_{k}$ compresses each class by contracting the features to a low-dimensional subspace; $\bm{E}^{\ell}$ expands all features by contrasting and repelling features across different classes.

Essentially, the linear operations $\bm{E}^{\ell}$ and $\bm{C}_{k}^{\ell}$ in gradient ascent for rate reduction are determined by training data conducting “auto-regressions”. The recent renewed understanding about ridge regression in an over-parameterized setting [YYY+20, WX20] indicates that using seemingly redundantly sampled data (from each subspace) as regressors does not lead to overfitting.

Gradient-Guided Feature Map Increment.

Notice that in the above, the gradient ascent considers all the features $\bm{Z}^{\ell}=[\bm{z}^{\ell}_{1},\dots,\bm{z}^{\ell}_{N}]$ as free variables. The increment $\bm{Z}^{\ell+1}-\bm{Z}^{\ell}=\eta\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}(\bm{Z}^{\ell})$ does not yet give a transformation on the entire feature domain $\bm{z}^{\ell}\in\mathbb{R}^{d}$ . According to equation (4.1.5), the gradient cannot be evaluated at a point whose membership is not known, as illustrated in Figure 4.1. Hence, in order to find the optimal $f(\bm{x},\bm{\theta})$ explicitly, we may consider constructing a small increment transform $g(\cdot,\bm{\theta}^{\ell})$ on the $\ell$ -th layer feature $\bm{z}^{\ell}$ to emulate the above (projected) gradient scheme:

\bm{z}^{\ell+1}\;\propto\;\bm{z}^{\ell}+\eta\cdot g(\bm{z}^{\ell},\bm{\theta}^{\ell})\quad\mbox{subject to}\quad\bm{z}^{\ell+1}\in\mathbb{S}^{d-1}

(4.1.8)

such that $\big{[}g(\bm{z}_{1}^{\ell},\bm{\theta}^{\ell}),\ldots,g(\bm{z}_{N}^{\ell},\bm{\theta}^{\ell})\big{]}\approx\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}(\bm{Z}^{\ell}).$ That is, we need to approximate the gradient flow $\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}$ that locally deforms all (training) features $\{\bm{z}_{i}^{\ell}\}_{i=1}^{N}$ with a continuous mapping $g(\bm{z},\bm{\theta})$ defined on the entire feature space $\bm{z}^{\ell}\in\mathbb{R}^{d}$ . Notice that one may interpret the increment (4.1.8) as a discretized version of a continuous differential equation:

\dot{\bm{z}}=g(\bm{z},\theta).

(4.1.9)

Hence the (deep) network so constructed can be interpreted as a certain neural ODE [CRB+18]. Nevertheless, unlike neural ODE where the flow $g$ is chosen to be some generic structures, here our $g(\bm{z},\bm{\theta})$ is to emulate the gradient flow of the rate reduction on the feature set (as shown in Figure 4.1):

\dot{\bm{Z}}=\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}},

and its structure is entirely derived and fully determined from this objective, without any other priors or heuristics.

By inspecting the structure of the gradient (4.1.5), it suggests that a natural candidate for the increment transform $g(\bm{z}^{\ell},\bm{\theta}^{\ell})$ is of the form:

g(\bm{z}^{\ell},\bm{\theta}^{\ell})\;\doteq\;\bm{E}^{\ell}\bm{z}^{\ell}-\sum_{k=1}^{K}\gamma_{k}\pi_{k}(\bm{z}^{\ell})\bm{C}_{k}^{\ell}\bm{z}^{\ell}\in\mathbb{R}^{d},

(4.1.10)

where $\pi_{k}(\bm{z}^{\ell})\in[0,1]$ indicates the probability of $\bm{z}^{\ell}$ belonging to the $k$ -th class. The increment map parameters $\bm{\theta}^{\ell}$ depend on: First, a set of linear maps represented by $\bm{E}^{\ell}$ and $\{\bm{C}^{\ell}_{k}\}_{k=1}^{K}$ that depend only on statistics of features of the training $\bm{Z}^{\ell}$ ; Second, the membership $\{\pi_{k}(\bm{z}^{\ell})\}_{k=1}^{K}$ of any feature $\bm{z}^{\ell}$ . Notice that on the training samples $\bm{Z}^{\ell}$ , for which the memberships $\bm{\Pi}_{k}$ are known, the so defined $g(\bm{z}^{\ell},\bm{\theta})$ gives exactly the values for the gradient $\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}(\bm{Z}^{\ell})$ .

Since we only have the membership for the training samples, the function $g(\cdot)$ defined in (4.1.10) can only be evaluated on the training. To extrapolate $g(\cdot)$ to the entire feature space, we need to estimate $\pi_{k}(\bm{z}^{\ell})$ in its second term. In conventional deep learning, this map is typically modeled as a deep network and learned from the training data, say via back propagation. Nevertheless, our goal here is not to learn a precise classifier $\pi_{k}(\bm{z}^{\ell})$ already. Instead, we only need a good enough estimate of the class information in order for $g(\cdot)$ to approximate the gradient $\frac{\partial\Delta R_{\epsilon}}{\partial\bm{Z}}$ well.

From the geometric interpretation of the linear maps $\bm{E}^{\ell}$ and $\bm{C}_{k}^{\ell}$ given by Remark 4.1, the term $\bm{p}_{k}^{\ell}\doteq\bm{C}^{\ell}_{k}\bm{z}^{\ell}$ can be viewed as (approximately) the projection of $\bm{z}^{\ell}$ onto the orthogonal complement of each class $j$ . Therefore, $\|\bm{p}_{j}^{\ell}\|_{2}$ is small if $\bm{z}^{\ell}$ is in class $j$ and large otherwise. This motivates us to estimate its membership based on the following softmax function:

\widehat{\bm{\pi}}(\bm{z}^{\ell})\doteq\operatorname{\mathrm{softmax}}\left(-\lambda\begin{bmatrix}\|\bm{C}^{\ell}_{1}\bm{z}^{\ell}\|_{2}\\ \vdots\\ \|\bm{C}^{\ell}_{K}\bm{z}^{\ell}\|_{2}\end{bmatrix}\right)=\frac{1}{\sum_{k=1}^{K}\exp(-\lambda\|\bm{C}^{\ell}_{k}\bm{z}^{\ell}\|_{2})}\begin{bmatrix}\exp(-\lambda\|\bm{C}^{\ell}_{1}\bm{z}^{\ell}\|_{2})\\ \vdots\\ \exp(-\lambda\|\bm{C}^{\ell}_{K}\bm{z}^{\ell}\|_{2})\end{bmatrix}\in[0,1]^{K}.

(4.1.11)

Hence, the second term of (4.1.10) can be approximated by this estimated membership:

\displaystyle\sum_{k=1}^{K}\gamma_{k}\pi_{k}(\bm{z}^{\ell})\bm{C}_{k}^{\ell}\bm{z}^{\ell}\;\approx\;\sum_{k=1}^{K}\gamma_{k}\widehat{\pi}_{k}(\bm{z}^{\ell})\bm{C}^{\ell}_{k}\bm{z}^{\ell}\;\doteq\;\bm{\sigma}\Big{(}[\bm{C}^{\ell}_{1}\bm{z}^{\ell},\dots,\bm{C}^{\ell}_{K}\bm{z}^{\ell}]\Big{)},

(4.1.12)

which is denoted as a nonlinear operator $\bm{\sigma}(\cdot)$ on outputs of the feature $\bm{z}^{\ell}$ through $K$ groups of filters: $[\bm{C}^{\ell}_{1},\dots,\bm{C}^{\ell}_{K}]$ . Notice that the nonlinearality arises due to a “soft” assignment of class membership based on the feature responses from those filters.

Overall, combining (4.1.8), (4.1.10), and (4.1.12), the increment feature transform from $\bm{z}^{\ell}$ to $\bm{z}^{\ell+1}$ now becomes

	$\displaystyle\bm{z}^{\ell+1}$	$\displaystyle\propto\;\bm{z}^{\ell}+\eta\cdot\bm{E}^{\ell}\bm{z}^{\ell}-\eta\cdot\bm{\sigma}\big{(}[\bm{C}^{\ell}_{1}\bm{z}^{\ell},\dots,\bm{C}^{\ell}_{K}\bm{z}^{\ell}]\big{)}$		(4.1.13)
		$\displaystyle=\;\bm{z}^{\ell}+\eta\cdot g(\bm{z}^{\ell},\bm{\theta}^{\ell})\qquad\mbox{s.t.}\quad\bm{z}^{\ell+1}\in\mathbb{S}^{d-1},$		(4.1.13)

with the nonlinear function $\bm{\sigma}(\cdot)$ defined above and $\bm{\theta}^{\ell}$ collecting all the layer-wise parameters. That is $\bm{\theta}^{\ell}=\left\{\bm{E}^{\ell},\bm{C}^{\ell}_{1},\dots,\bm{C}^{\ell}_{K},\gamma_{k},\lambda\right\}$ . Note features at each layer are always “normalized” by projecting onto the unit sphere $\mathbb{S}^{d-1}$ , denoted as $\mathcal{P}_{\mathbb{S}^{d-1}}$ . The form of increment in (4.1.13) can be illustrated by a diagram in Figure 4.3(a).

Algorithm 4.1 Training algorithm for ReduNet

\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{N}]\in\mathbb{R}^{D\times N}

\bm{\Pi}=\{\bm{\Pi}_{k}\}_{k=1}^{K}

\epsilon>0

\lambda

, and a learning rate

\eta

2:The learned parameters

\{\bm{E}^{\ell}\}_{\ell=1}^{L},\{\{\bm{C}^{\ell}_{k}\}_{k=1}^{K}\}_{\ell=1}^{L},\{\gamma_{k}\}_{k=1}^{k}

3:procedure ReduNetTraining(

\bm{X},\bm{\Pi},\epsilon,\lambda,\eta

)

4: # Define constants

\alpha\leftarrow d/(N\epsilon^{2})

6: for

k\in\{1,\dots,K\}

\alpha_{k}\leftarrow D/(\operatorname{tr}(\bm{\Pi}_{k})\epsilon^{2})

\gamma_{k}\leftarrow\operatorname{tr}(\bm{\Pi}_{k})/D

9: end for

10:

11: # ReduNet layer-by-layer iteration

12:

\bm{Z}^{1}=\begin{bmatrix}\bm{z}_{1}^{1},\dots,\bm{z}_{N}^{1}\end{bmatrix}\leftarrow\bm{X}

\triangleright

Initialize the ReduNet per-layer iteration

13: for

\ell\in\{1,\dots,L\}

14: # Step 1: Compute network parameters

\bm{E}^{\ell},\{\bm{C}^{\ell}_{k}\}_{k=1}^{K}

15:

\bm{E}^{\ell}\leftarrow\alpha\left(\bm{I}+\alpha\bm{Z}^{\ell}(\bm{Z}^{\ell})^{\top}\right)^{-1}\in\mathbb{R}^{d\times d}

16: for

k\in\{1,\dots,K\}

17:

\bm{C}^{\ell}_{k}\leftarrow\alpha_{k}\left(\bm{I}+\alpha_{k}\bm{Z}^{\ell}\bm{\Pi}_{k}(\bm{Z}^{\ell})^{\top}\right)^{-1}\in\mathbb{R}^{d\times d}

18: end for

19:

20: # Step 2: Update features

\bm{Z}^{\ell}

21: for

i\in\{1,\dots,N\}

22:

\hat{\bm{\pi}}(\bm{z}^{\ell}_{i})\leftarrow\displaystyle\operatorname{\mathrm{softmax}}(-\lambda[\|\bm{C}^{\ell}_{1}\bm{z}^{\ell}_{i}\|_{2},\dots,\|\bm{C}^{\ell}_{K}\bm{z}^{\ell}_{i}\|_{2}])\in[0,1]^{K}

\triangleright

Compute soft assignments

\hat{\bm{\pi}}(\bm{z}^{\ell}_{i})

23:

\displaystyle\bm{z}^{\ell+1}_{i}\leftarrow\mathcal{P}_{\mathbb{S}^{d-1}}\left(\bm{z}^{\ell}_{i}+\eta\left(\bm{E}^{\ell}\bm{z}^{\ell}_{i}-\sum_{k=1}^{K}\gamma_{k}\hat{\pi}_{k}(\bm{z}^{\ell}_{i})\bm{C}^{\ell}_{k}\bm{z}^{\ell}_{i}\right)\right)\in\mathbb{R}^{d}

\triangleright

Update features

\bm{z}^{\ell+1}_{i}

from

\bm{z}^{\ell}_{i}

24: end for

25: end for

26: return

\{\bm{E}^{\ell}\}_{\ell=1}^{L},\{\{\bm{C}^{\ell}_{k}\}_{k=1}^{K}\}_{\ell=1}^{L},\{\gamma_{k}\}_{k=1}^{K}

\triangleright

Return all network parameters.

27:end procedure

Deep Network for Optimizing Rate Reduction.

Notice that the increment is constructed to emulate the gradient ascent for the rate reduction $\Delta R_{\epsilon}$ . Hence by transforming the features iteratively via the above process, we expect the rate reduction to increase, as we will see in the experimental section. This iterative process, once converged say after $L$ iterations, gives the desired feature map $f(\bm{x},\bm{\theta})$ on the input $\bm{x}=\bm{z}^{0}$ , precisely in the form of a deep network, in which each layer has the structure shown in Figure 4.3 left:

$\displaystyle f(\bm{x},\bm{\theta})\;=$	$\displaystyle\;\;f^{L}\circ f^{L-1}\circ\cdots\circ f^{1}\circ f^{0}(\bm{z}^{0}),$	(4.1.14)
$\displaystyle f^{\ell}(\bm{z}^{\ell},\bm{\theta}^{\ell})\;\doteq$	$\displaystyle\;\;\bm{z}^{\ell+1}=\mathcal{P}_{\mathbb{S}^{n-1}}[\bm{z}^{\ell}+\eta\cdot g(\bm{z}^{\ell},\bm{\theta}^{\ell})],$
$\displaystyle g(\bm{z}^{\ell},\bm{\theta}^{\ell})\;=$	$\displaystyle\;\;\bm{E}^{\ell}\bm{z}^{\ell}-\bm{\sigma}\big{(}[\bm{C}^{\ell}_{1}\bm{z}^{\ell},\dots,\bm{C}^{\ell}_{K}\bm{z}^{\ell}]\big{)}.$

As this deep network is derived from maximizing the rate reducation, we call it the ReduNet. By comparing the architecture of ReduNet with those of popular empirically designed networks, ResNet and ResNeXt shown in Figure 4.3, the similarity is somewhat uncanny. Conceptually, ReduNet could also be used to justify the popular mixture of experts (MoE) architecture [SMM+17] as each parallel channel, $\bm{C}^{\ell}_{k}$ , can be viewed as an “expert” trained for each class of objects.

Figure 4.4 : Left: a mixture of experts (MoE) deep network [ SMM+17 ] . Right: a sparsity-promoting Switch Transformer [ FZS22 ] , used to implement MoE with 1.7 trillion parameters. — Figure 4.4: Left: a mixture of experts (MoE) deep network [SMM+17]. Right: a sparsity-promoting Switch Transformer [FZS22], used to implement MoE with 1.7 trillion parameters.

We summarize the training and evaluation of ReduNet in Algorithm 4.1 and Algorithm 4.2, respectively. Notice that all parameters of the network are explicitly constructed layer by layer in a forward propagation fashion. The construction does not need any back propagation! The so-learned features can be directly used for classification, say via a nearest subspace classifier.

Algorithm 4.2 Evaluation algorithm for ReduNet

1:Input

\bm{x}\in\mathbb{R}^{D}

, network parameters

\{\bm{E}^{\ell}\}_{\ell=1}^{L},\{\{\bm{C}^{\ell}_{k}\}_{k=1}^{K}\}_{\ell=1}^{L},\{\gamma_{k}\}_{k=1}^{K}

, learning rate

\lambda

2:feature

\bm{z}^{L+1}

3:procedure ReduNetEvaluation(

\bm{x}

)

\bm{z}^{1}\leftarrow\bm{x}\in\mathbb{R}^{D}

\triangleright

Initialize the ReduNet per-layer iteration

5: for

\ell\in\{1,\dots,L\}

\hat{\bm{\pi}}(\bm{z}^{\ell})\leftarrow\operatorname{\mathrm{softmax}}(-\lambda\begin{bmatrix}\|\bm{C}^{\ell}_{1}\bm{z}^{\ell}\|_{2},\dots,\|\bm{C}^{\ell}_{K}\bm{z}^{\ell}\|_{2}\end{bmatrix})\in[0,1]^{K}

\triangleright

Compute soft assignments

\hat{\bm{\pi}}(\bm{z}^{\ell})

\bm{z}^{\ell+1}\leftarrow\mathcal{P}_{\mathbb{S}^{d-1}}\left(\bm{z}^{\ell}+\eta\left(\bm{E}^{\ell}\bm{z}^{\ell}-\sum_{k=1}^{K}\gamma_{k}\hat{\pi}_{k}(\bm{z}^{\ell})\bm{C}^{\ell}_{k}\bm{z}^{\ell}\right)\right)\in\mathbb{R}^{d}

\triangleright

Update feature

\bm{z}^{\ell+1}

using

\bm{z}^{\ell}

8: end for

9: return

\bm{z}^{L+1}

\triangleright

Return the output features

10:end procedure

Example 4.1.

To provide some intuition on how ReduNet transforms the features, we provide a simple example with mixed 3D Gaussians and visualize how the features are transformed in Figure 4.5. Consider a mixture of three Gaussian distributions in $\mathbb{R}^{3}$ that is projected onto $\mathbb{S}^{2}$ . We first generate data points for 3 classes: for $k=1,2,3$ , $\bm{X}_{k}=[\bm{x}_{k,1},\ldots,\bm{x}_{k,m}]\in\mathbb{R}^{3\times m}$ , $\bm{x}_{k,i}\sim\mathcal{N}(\bm{\mu}_{k},\sigma_{k}^{2}\bm{I})$ , and ${\pi}(\bm{x}_{k,i})=k$ . We set $m=500,\sigma_{1}=\sigma_{2}=\sigma_{3}=0.1$ , and $\bm{\mu}_{1},\bm{\mu}_{2},\bm{\mu}_{3}\in\mathbb{S}^{2}$ . Then we project all the data points onto $\mathbb{S}^{2}$ , i.e., $\bm{x}_{k,i}/\|\bm{x}_{k,i}\|_{2}$ . To construct the network (computing $\bm{E}^{\ell},\bm{C}^{\ell}_{k}$ for the $\ell$ -th layer), we set the number of iterations/layers $L=2,000$ , step size $\eta=0.5$ , and precision $\epsilon=0.1$ . We do this only to demonstrate that our framework leads to stable deep networks even with thousands of layers. In practice, thousands of layers may not be necessary and one can stop whenever adding new layers gives diminishing returns. For this example, a couple of hundred layers is sufficient. Hence, the clear optimization objective gives a natural criterion for the depth of the network needed.

As shown in Figure 4.5, we can observe that after the mapping $f(\cdot,\bm{\theta})$ , samples from the same class are highly compressed and converge to a single cluster and the angle between two different clusters is approximately $\pi/2$ , which is well aligned with the optimal solution $\bm{Z}^{\star}$ of the MCR² loss in $\mathbb{S}^{2}$ . MCR² loss of features on different layers can be found in Figure 4.5(c). Empirically, we find that the constructed ReduNet is able to maximize MCR² loss and converges stably and samples from the same class converge to one cluster and different clusters are orthogonal to each other. Moreover, when sampling new data points from the same distributions, we find that new samples from the same class consistently converge to the same cluster center as the training samples.

(a) 𝑿 train \bm{X}_{\text{train}} bold_italic_X start_POSTSUBSCRIPT train end_POSTSUBSCRIPT — (a) $\bm{X}_{\text{train}}$

$\blacksquare$

4.1.2 Convolutional Networks from Invariant Rate Reduction

In the previous section, we derived the layer-wise architecture of a deep network, the ReduNet, using unrolled optimization for the rate reduction objective. Specifically, the compression term $R^{c}_{\epsilon}(\bm{Z}\mid\bm{\Pi})$ in (4.1.1) is designed to compress representations from the same class. However, this formulation does not account for possible domain transformation or deformation of the input data. For instance, shifting an object slightly to the right does not change the semantic label of an image. In this section, we will demonstrate how convolutional layers can be derived by maximizing a rate reduction objective that is invariant to certain domain deformations, such as image rotations and translations.

For many clustering or classification tasks (such as object detection in images), we consider two samples as equivalent if they differ by certain classes of domain deformations or augmentations $\mathcal{T}=\{\tau\}$ . Hence, we are only interested in low-dimensional structures that are invariant to such deformations (i.e., $\bm{x}\in\mathcal{M}$ iff $\tau(\bm{x})\in\mathcal{M}$ for all $\tau\in\mathcal{T}$ ), which are known to have sophisticated geometric and topological structures and can be difficult to learn precisely in practice, even with rigorously designed CNNs [CW16]. In this framework, this can be formulated in a very natural way: all equivariant instances are to be embedded into the same subspace, so that the subspace itself is invariant to the transformations under consideration.

In many applications, such as serial data or imagery data, the semantic meaning (labels) of the data are invariant to certain transformations $\mathfrak{g}\in\mathbb{G}$ (for some group $\mathbb{G}$ ) [CW16b, ZKR+17]. For example, the meaning of an audio signal is invariant to shift in time; and the identity of an object in an image is invariant to translation in the image plane. Hence, we prefer the feature mapping $f(\bm{x},\bm{\theta})$ is rigorously invariant to such transformations:

\mbox{\em Group Invariance:}\;f(\bm{x}\circ\mathfrak{g},\bm{\theta})\sim f(\bm{x},\bm{\theta}),\ \forall\mathfrak{g}\in\mathbb{G},

(4.1.15)

where “ $\sim$ ” indicates two features belonging to the same equivalent class. Although to ensure invariance or equivarience, convolutional operators have been common practice in deep networks [CW16b], it remains challenging in practice to train an (empirically designed) convolution network from scratch that can guarantee invariance even to simple transformations such as translation and rotation [AW18, ETT+17]. An alternative approach is to carefully design convolution filters of each layer so as to ensure translational invariance for a wide range of signals, say using wavelets as in ScatteringNet [BM13] and followup works [WB18]. However, in order to ensure invariance to generic signals, the number of convolutions needed usually grows exponentially with network depth. That is the reason why this type of network cannot be constructed so deep, usually only several layers.

Now, we show that the MCR² principle is compatible with invariance in a natural and precise way: we only need to assign all transformed versions $\{\bm{x}\circ\mathfrak{g}\mid\mathfrak{g}\in\mathbb{G}\}$ into the same class as the data $\bm{x}$ and map their features $\bm{z}$ all to the same subspace $\mathcal{S}$ . Hence, all group equivariant information is encoded only inside the subspace, and any classifier defined on the resulting set of subspaces will be automatically invariant to such group transformations. See Figure 4.6 for an illustration of the examples of 1D rotation and 2D translation. Next, we will rigorously show that when the group $\mathbb{G}$ is circular 1D shifting, the resulting deep network naturally becomes a multi-channel convolution network. Because the so-constructed network only needs to ensure invariance for the given data $\bm{X}$ or their features $\bm{Z}$ , the number of convolutions needed actually remains constant through a very deep network, as opposed to the ScatteringNet.

Figure 4.6 : Illustration of the sought representation that is equivariant/invariant to image rotation (left) or translation (right): all transformed images of each class are mapped into the same subspace that is incoherent to other subspaces. The features embedded in each subspace are equivariant to the transformation group whereas each subspace is invariant to such transformations. — Figure 4.6: Illustration of the sought representation that is equivariant/invariant to image rotation (left) or translation (right): all transformed images of each class are mapped into the same subspace that is incoherent to other subspaces. The features embedded in each subspace are equivariant to the transformation group whereas each subspace is invariant to such transformations.

1D Serial Data and Shift Invariance

To classify one-dimensional data $\bm{x}=[x(0),x(1),\ldots,x(D-1)]\in\mathbb{R}^{D}$ invariant under shifting, we take $\mathbb{G}$ to be the group of all circular shifts. Each observation $\bm{x}_{i}$ generates a family $\{\bm{x}_{i}\circ\mathfrak{g}\,|\,\mathfrak{g}\in\mathbb{G}\}$ of shifted copies, which are the columns of the circulant matrix $\mathsf{circ}(\bm{x}_{i})\in\mathbb{R}^{D\times D}$ given by

\mathsf{circ}(\bm{x})\,\doteq\,\left[\begin{array}[]{ccccc}x(0)&x(D-1)&\dots&x(2)&x(1)\\ x(1)&x(0)&x(D-1)&\cdots&x(2)\\ \vdots&x(1)&x(0)&\ddots&\vdots\\ x(D-2)&\vdots&\ddots&\ddots&x(D-1)\\ x(D-1)&x(D-2)&\dots&x(1)&x(0)\end{array}\right]\in\mathbb{R}^{D\times D}.

(4.1.16)

We refer the reader to [KS12] for properties of circulant matrices. For simplicity, let $\bm{Z}^{1}\doteq[\bm{z}_{1}^{1},\dots,\bm{z}_{N}^{1}]=\bm{X}\in\mathbb{R}^{d\times N}$ .⁷⁷7Again, to simplify discussion, we assume for now that the initial features $\bm{Z}^{1}$ are $\bm{X}$ themselves hence have the same dimension $d$ , i.e., $D=d$ . But that does not need to be the case as we will soon see that we need to lift $\bm{X}$ to a higher dimension. Then what happens if we construct the ReduNet from their circulant families $\mathsf{circ}(\bm{Z}^{1})=\left[\mathsf{circ}(\bm{z}_{1}^{1}),\dots,\mathsf{circ}(\bm{z}_{N}^{1})\right]\in\mathbb{R}^{d\times(dN)}$ ? That is, we want to compress and map all these into the same subspace by the ReduNet.

Notice that now the data covariance matrix:

	$\displaystyle\mathsf{circ}(\bm{Z}^{1})\mathsf{circ}(\bm{Z}^{1})^{\top}$	$\displaystyle=$	$\displaystyle\left[\mathsf{circ}(\bm{z}_{1}^{1}),\dots,\mathsf{circ}(\bm{z}_{N}^{1})\right]\left[\mathsf{circ}(\bm{z}_{1}^{1}),\dots,\mathsf{circ}(\bm{z}_{N}^{1})\right]^{\top}$		(4.1.17)
		$\displaystyle=$	$\displaystyle\sum_{i=1}^{N}\mathsf{circ}(\bm{z}_{i}^{1})\mathsf{circ}(\bm{z}_{i}^{1})^{\top}\;\in\mathbb{R}^{d\times d}$		(4.1.17)

associated with this family of samples is automatically a (symmetric) circulant matrix. Moreover, because the circulant property is preserved under sums, inverses, and products, the matrices $\bm{E}^{1}$ and $\bm{C}^{1}_{k}$ are also automatically circulant matrices, whose application to a feature vector $\bm{z}\in\mathbb{R}^{d}$ can be implemented using circular convolution “ $\circledast$ ”. Specifically, we have the following proposition.

Proposition 4.1 (Convolution structures of $\bm{E}^{1}$ and $\bm{C}^{1}_{k}$ ).

The matrix

\bm{E}^{1}=\alpha\big{(}\bm{I}+\alpha\mathsf{circ}(\bm{Z}^{1})\mathsf{circ}(\bm{Z}^{1})^{\top}\big{)}^{-1}

(4.1.18)

is a circulant matrix and represents a circular convolution:

\bm{E}^{1}\bm{z}=\bm{e}_{1}\circledast\bm{z},

where $\bm{e}_{1}\in\mathbb{R}^{d}$ is the first column vector of $\bm{E}^{1}$ and “ $\circledast$ ” is circular convolution defined as

(\bm{e}_{1}\circledast\bm{z})_{i}\doteq\sum_{j=0}^{d-1}e_{1}(j)x(i+d-j\,\,\textsf{mod}\,\,d).

Similarly, the matrices $\bm{C}^{1}_{k}$ associated with any subsets of $\bm{Z}^{1}$ are also circular convolutions.

Not only do the first-layer parameters $\bm{E}^{1}$ and $\bm{C}^{1}_{k}$ of the ReduNet become circulant convolutions but also the next-layer features remain circulant matrices. That is, the incremental feature transform in (4.1.13) applied to all shifted versions of a $\bm{z}^{1}\in\mathbb{R}^{d}$ , given by

\mathsf{circ}(\bm{z}^{1})+\eta\cdot\bm{E}^{1}\mathsf{circ}(\bm{z}^{1})-\eta\cdot\bm{\sigma}\Big{(}[\bm{C}_{1}^{1}\mathsf{circ}(\bm{z}^{1}),\ldots,\bm{C}^{1}_{K}\mathsf{circ}(\bm{z}^{1})]\Big{)},

(4.1.19)

is a circulant matrix. This implies that there is no need to construct circulant families from the second layer features as we did for the first layer. By denoting

\bm{z}^{2}\propto\bm{z}^{1}+\eta\cdot g(\bm{z}^{1},\bm{\theta}^{1})=\bm{z}^{1}+\eta\cdot\bm{e}_{1}\circledast\bm{z}^{1}-\eta\cdot\bm{\sigma}\Big{(}[\bm{c}_{1}^{1}\circledast\bm{z}^{1},\dots,\bm{c}^{1}_{K}\circledast\bm{z}^{1}]\Big{)},

(4.1.20)

the features at the next level can be written as

\mathsf{circ}(\bm{Z}^{2})=\big{[}\mathsf{circ}(\bm{z}_{1}^{1}+\eta g(\bm{z}_{1}^{1},\bm{\theta}^{1})),\dots,\mathsf{circ}(\bm{z}_{N}^{1}+\eta g(\bm{z}_{N}^{1},\bm{\theta}^{1}))\big{]}.

Continuing inductively, we see that all matrices $\bm{E}^{\ell}$ and $\bm{C}^{\ell}_{k}$ based on such $\mathsf{circ}(\bm{Z}^{\ell})$ are circulant, and so are all features. By virtue of the properties of the data, ReduNet has taken the form of a convolutional network, with no need to explicitly choose this structure!

A Fundamental Trade-off between Invariance and Sparsity.

There is one problem though: In general, the set of all circular permutations of a vector $\bm{z}$ gives a full-rank matrix. That is, the $d$ “augmented” features associated with each sample (hence each class) typically already span the entire space $\mathbb{R}^{d}$ . For instance, all shifted versions of a delta function $\delta(d)$ can generate any other signal as their (dense) weighted superposition. The MCR² objective (3.4.12) will not be able to distinguish classes as different subspaces.

One natural remedy is to improve the separability of the data by “lifting” the original signal to a higher dimensional space, e.g., by taking their responses to multiple, filters $\bm{k}_{1},\ldots,\bm{k}_{C}\in\mathbb{R}^{d}$ :

\bm{z}[c]=\bm{k}_{c}\circledast\bm{x}=\mathsf{circ}(\bm{k}_{c})\bm{x}\in\mathbb{R}^{d},\quad c=1,\ldots,C.

(4.1.21)

The filters can be pre-designed invariance-promoting filters,⁸⁸8For 1D signals like audio, one may consider the conventional short-time Fourier transform (STFT); for 2D images, one may consider 2D wavelets as in the ScatteringNet [BM13]. or adaptively learned from the data,⁹⁹9For learned filters, one can learn filters as the principal components of samples as in the PCANet [CJG+15] or from convolution dictionary learning [LB19, QLZ19]. or randomly selected as we do in our experiments. This operation lifts each original signal $\bm{x}\in\mathbb{R}^{d}$ to a $C$ -channel feature, denoted as $\bar{\bm{z}}\doteq[\bm{z}[1],\ldots,\bm{z}[C]]^{\top}\in\mathbb{R}^{C\times d}$ . Then, we may construct the ReduNet on vector representations of $\bar{\bm{z}}$ , denoted as $\vec{(}\bar{\bm{z}})\doteq[\bm{z}[1]^{\top},\ldots,\bm{z}[C]^{\top}]\in\mathbb{R}^{dC}$ . The associated circulant version $\mathsf{circ}(\bar{\bm{z}})$ and its data covariance matrix, denoted as $\bar{\bm{\Sigma}}(\bar{\bm{z}})$ , for all its shifted versions are given as:

\displaystyle\mathsf{circ}(\bar{\bm{z}})\doteq\left[\begin{matrix}\mathsf{circ}(\bm{z}[1])\\ \vdots\\ \mathsf{circ}(\bm{z}[C])\end{matrix}\right]\in\mathbb{R}^{dC\times d},\quad\bar{\bm{\Sigma}}(\bar{\bm{z}})\doteq\left[\begin{matrix}\mathsf{circ}(\bm{z}[1])\\ \vdots\\ \mathsf{circ}(\bm{z}[C])\end{matrix}\right]\left[\begin{matrix}\mathsf{circ}(\bm{z}[1])^{\top},\ldots,\mathsf{circ}(\bm{z}[C])^{\top}\end{matrix}\right]\in\mathbb{R}^{dC\times dC},

(4.1.22)

where $\mathsf{circ}(\bm{z}[c])\in\mathbb{R}^{d\times d}$ with $c\in[C]$ is the circulant version of the $c$ -th channel of the feature $\bar{\bm{z}}$ . Then the columns of $\mathsf{circ}(\bar{\bm{z}})$ will only span at most a $d$ -dimensional proper subspace in $\mathbb{R}^{dC}$ . However, this simple lifting operation (if linear) is not sufficient to render the classes separable yet—features associated with other classes will span the same $d$ -dimensional subspace. This reflects a fundamental conflict between invariance and linear (subspace) modeling: one cannot hope for arbitrarily shifted and superposed signals to belong to the same class.

Figure 4.7 : Each input signal 𝒙 \bm{x} bold_italic_x (an image here) can be represented as a superposition of sparse convolutions with multiple kernels 𝒅 c \bm{d}_{c} bold_italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in a dictionary 𝑫 \bm{D} bold_italic_D . — Figure 4.7: Each input signal $\bm{x}$ (an image here) can be represented as a superposition of sparse convolutions with multiple kernels $\bm{d}_{c}$ in a dictionary $\bm{D}$ .

One way of resolving this conflict is to leverage additional structure within each class, in the form of sparsity: signals within each class are not generated as an arbitrary linear superposition of some base atoms (or motifs), but only sparse combinations of them and their shifted versions, as shown in Figure 4.7. More precisely, let $\bm{D}_{k}=[\bm{d}_{k,1},\ldots,\bm{d}_{k,c}]$ denote a matrix with a collection of atoms associated for class $k$ , also known as a dictionary, then each signal $\bm{x}$ in this class is sparsely generated as:

\bm{x}=\bm{d}_{k,1}\circledast z_{1}+\ldots+\bm{d}_{k,c}\circledast z_{c}=\mathsf{circ}(\bm{D}_{k})\bm{z},

(4.1.23)

for some sparse vector $\bm{z}$ . Signals in different classes are then generated by different dictionaries whose atoms (or motifs) are incoherent from one another. Due to incoherence, signals in one class are unlikely to be sparsely represented by atoms in any other class. Hence all signals can be represented as

\bm{x}=\big{[}\mathsf{circ}(\bm{D}_{1}),\mathsf{circ}(\bm{D}_{2}),\ldots,\mathsf{circ}(\bm{D}_{K})\big{]}\bar{\bm{z}},

(4.1.24)

where $\bar{\bm{z}}$ is sparse.¹⁰¹⁰10Notice that similar sparse representation models have long been proposed and used for classification purposes in applications such a face recognition, demonstrating excellent effectiveness [WYG+09, WWG+12]. Recently, the convolution sparse coding model has been proposed by [PRE17] as a framework for interpreting the structures of deep convolution networks. There is a vast literature on how to learn the most compact and optimal sparsifying dictionaries from sample data, e.g. [LB19, QLZ19] and subsequently solve the inverse problem and compute the associated sparse code $\bm{z}$ or $\bar{\bm{z}}$ . Recent studies of [QLZ20, QZL+20] even show that under broad conditions the convolution dictionary learning problem can be solved effectively and efficiently.

Nevertheless, for tasks such as classification, we are not necessarily interested in the precise optimal dictionary nor the precise sparse code for each individual signal. We are mainly interested if collectively the set of sparse codes for each class are adequately separable from those of other classes. Under the assumption of the sparse generative model, if the convolution kernels $\{\bm{k}_{c}\}_{c=1}^{C}$ match well with the “transpose” or “inverse” of the above sparsifying dictionaries $\bm{D}=[\bm{D}_{1},\ldots,\bm{D}_{K}]$ , also known as the analysis filters [NDE+13, RE14], signals in one class will only have high responses to a small subset of those filters and low responses to others (due to the incoherence assumption). Nevertheless, in practice, often a sufficiently large number of, say $C$ , random filters $\{\bm{k}_{c}\}_{c=1}^{C}$ suffices to ensure that the extracted $C$ -channel features

\big{[}\bm{k}_{1}\circledast\bm{x},\bm{k}_{2}\circledast\bm{x},\ldots,\bm{k}_{C}\circledast\bm{x}\big{]}^{\top}=\big{[}\mathsf{circ}(\bm{k}_{1})\bm{x},\ldots,\mathsf{circ}(\bm{k}_{C})\bm{x}\big{]}^{\top}\in\mathbb{R}^{C\times d}

(4.1.25)

for different classes have different response patterns to different filters hence make different classes separable [CJG+15].

Therefore, in our framework, to a large extent the number of channels (or the width of the network) truly plays the role as the statistical resource whereas the number of layers (the depth of the network) plays the role as the computational resource. The theory of compressive sensing precisely characterizes how many measurements are needed in order to preserve the intrinsic low-dimensional structures (including separability) of the data [WM21].

Figure 4.8 : Estimate the sparse code 𝒛 ¯ \bar{\bm{z}} over¯ start_ARG bold_italic_z end_ARG of an input signal 𝒙 \bm{x} bold_italic_x (an image here) by taking convolutions with multiple kernels 𝒌 c \bm{k}_{c} bold_italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and then sparsifying. — Figure 4.8: Estimate the sparse code $\bar{\bm{z}}$ of an input signal $\bm{x}$ (an image here) by taking convolutions with multiple kernels $\bm{k}_{c}$ and then sparsifying.

The multi-channel responses $\bar{\bm{z}}$ should be sparse. So to approximate the sparse code $\bar{\bm{z}}$ , we may take an entry-wise sparsity-promoting nonlinear thresholding, say $\bm{\tau}(\cdot)$ , on the above filter outputs by setting low (say absolute value below $\epsilon$ ) or negative responses to be zero:

\bar{\bm{z}}\doteq\bm{\tau}\left(\big{[}\mathsf{circ}(\bm{k}_{1})\bm{x},\ldots,\mathsf{circ}(\bm{k}_{C})\bm{x}\big{]}^{\top}\right)\in\mathbb{R}^{C\times d}.

(4.1.26)

Figure 4.8 illustrates the basic ideas. One may refer to [RE14] for a more systematical study on the design of the sparsifying thresholding operator. Nevertheless, here we are not so interested in obtaining the best sparse codes as long as the codes are sufficiently separable. Hence the nonlinear operator $\bm{\tau}$ can be simply chosen to be a soft thresholding or a ReLU. These presumably sparse features $\bar{\bm{z}}$ can be assumed to lie on a lower-dimensional (nonlinear) submanifold of $\mathbb{R}^{dC}$ , which can be linearized and separated from the other classes by subsequent ReduNet layers, as illustrated later in Figure 4.9.

The ReduNet constructed from circulant version of these multi-channel features $\bar{\bm{Z}}\doteq[\bar{\bm{z}}_{1},\ldots,\bar{\bm{z}}_{N}]\in\mathbb{R}^{C\times d\times N}$ , i.e., $\mathsf{circ}(\bar{\bm{Z}})\doteq[\mathsf{circ}(\bar{\bm{z}}_{1}),\dots,\mathsf{circ}(\bar{\bm{z}}_{N})]\in\mathbb{R}^{dC\times dN}$ , retains the good invariance properties described above: the linear operators, now denoted as $\bar{\bm{E}}$ and $\bar{\bm{C}}_{k}$ , remain block circulant, and represent multi-channel 1D circular convolutions. Specifically, we have the following result.

Proposition 4.2 (Multi-channel convolution structures of $\bar{\bm{E}}$ and $\bar{\bm{C}}_{k}$ ).

The matrix

\bar{\bm{E}}\doteq\alpha\left(\bm{I}+\alpha\,\mathsf{circ}(\bar{\bm{Z}})\mathsf{circ}(\bar{\bm{Z}})^{\top}\right)^{-1}

(4.1.27)

is block circulant, i.e.,

\bar{\bm{E}}=\left[\begin{matrix}\bar{\bm{E}}_{1,1}&\cdots&\bar{\bm{E}}_{1,C}\\ \vdots&\ddots&\vdots\\ \bar{\bm{E}}_{C,1}&\cdots&\bar{\bm{E}}_{C,C}\\ \end{matrix}\right]\in\mathbb{R}^{dC\times dC},

where each $\bar{\bm{E}}_{c,c^{\prime}}\in\mathbb{R}^{d\times d}$ is a circulant matrix. Moreover, $\bar{\bm{E}}$ represents a multi-channel circular convolution, i.e., for any multi-channel signal $\bar{\bm{z}}\in\mathbb{R}^{C\times n}$ we have

\bar{\bm{E}}\cdot\textsf{vec}(\bar{\bm{z}})=\textsf{vec}(\bar{\bm{e}}\circledast\bar{\bm{z}}).

In above, $\bar{\bm{e}}\in\mathbb{R}^{C\times C\times d}$ is a multi-channel convolutional kernel with $\bar{\bm{e}}[c,c^{\prime}]\in\mathbb{R}^{d}$ being the first column vector of $\bar{\bm{E}}_{c,c^{\prime}}$ , and $\bar{\bm{e}}\circledast\bar{\bm{z}}\in\mathbb{R}^{C\times d}$ is the multi-channel circular convolution defined as

(\bar{\bm{e}}\circledast\bar{\bm{z}})[c]\doteq\sum_{c^{\prime}=1}^{C}\bar{\bm{e}}[c,c^{\prime}]\circledast\bar{\bm{z}}[c^{\prime}],\quad\forall c=1,\ldots,C.

Similarly, the matrices $\bar{\bm{C}}_{k}$ associated with any subsets of $\bar{\bm{Z}}$ are also multi-channel circular convolutions.

From Proposition 4.2, shift invariant ReduNet is a deep convolutional network for multi-channel 1D signals by construction. Notice that even if the initial lifting kernels are separated (4.1.26), the matrix inverse in (4.1.27) for computing $\bar{\bm{E}}$ (similarly for $\bar{\bm{C}_{k}}$ ) introduces “cross talk” among all $C$ channels. Hence, these multi-channel convolutions in general are not depth-wise separable, unlike the Xception nets [Cho17] that were once suggested to simplify multi-channel convolutional neural networks.¹¹¹¹11It remains open what additional structures on the data would lead to depth-wise separable convolutions.

Remark 4.2 (Reducing Computational Complexity in the Frequency Domain).

The calculation of $\bar{\bm{E}}$ in (4.1.27) requires inverting a matrix of size $dC\times dC$ , which in general has complexity $O(d^{3}C^{3})$ . Nevertheless, by using the fact that a circulant matrix can be diagonalized by the Discrete Fourier Transform (DFT) matrix, the complexity can be significantly reduced. As shown in [CYY+22], to compute $\bar{\bm{E}}$ and $\bar{\bm{C}}_{k}\in\mathbb{R}^{dC\times dC}$ , we only need to compute in the frequency domain the inverse of $C\times C$ blocks for $d$ times hence the overall complexity becomes $O(dC^{3})$ .

Overall Network Architecture and Comparison.

Following the above derivation, we see that in order to find a linear discriminative representation (LDR) for multiple classes of signals/images that is invariant to translation, sparse coding, a multi-layer architecture with multi-channel convolutions, different nonlinear activation, and spectrum computing all become necessary components for achieving the objective effectively and efficiently. Figure 4.9 illustrates the overall process of learning such a representation via invariant rate reduction on the input sparse codes.

Figure 4.9 : The overall process for classifying multi-class signals with shift invariance: Multi-channel lifting, sparse coding, followed by a multi-channel convolution ReduNet for invariant rate reduction. These components are necessary in order to map shift-invariant multi-class signals to incoherent (linear) subspaces as an LDR. Note that the architectures of most modern deep neural networks resemble this process. The so-learned LDR facilitates subsequent tasks such as classification. — Figure 4.9: The overall process for classifying multi-class signals with shift invariance: Multi-channel lifting, sparse coding, followed by a multi-channel convolution ReduNet for invariant rate reduction. These components are necessary in order to map shift-invariant multi-class signals to incoherent (linear) subspaces as an LDR. Note that the architectures of most modern deep neural networks resemble this process. The so-learned LDR facilitates subsequent tasks such as classification.

Example 4.2 (Invariant Classification of Digits).

We next provide an empirical performance of the ReduNet on learning rotation invariant features on the real 10-class MNIST dataset. We impose a polar grid on the image $\bm{x}\in\mathbb{R}^{H\times W}$ , with its geometric center being the center of the 2D polar grid (as illustrated in Figure 4.10). For each radius $r_{i}$ , $i\in[C]$ , we can sample $\Gamma$ pixels with respect to each angle $\gamma_{l}=l\cdot({2\pi}/\Gamma)$ with $l\in[\Gamma]$ . Then given a sample image $\bm{x}$ from the dataset, we represent the image in the (sampled) polar coordinate as a multi-channel signal $\bm{x}_{p}\in\mathbb{R}^{\Gamma\times C}$ . The goal here is to learn a rotation invariant representation, i.e., we expect to learn $f(\cdot,\bm{\theta})$ such that $\{f(\bm{x}_{p}\circ\mathfrak{g},\bm{\theta})\}_{\mathfrak{g}\in\mathbb{G}}$ lie in the same subspace, where $\mathfrak{g}$ is the cyclic-shift in polar angle. We use $N=100$ training samples ( $10$ from each class) and set $\Gamma=200$ , $C=15$ for polar sampling. By performing the above sampling in polar coordinate, we can obtain the data matrix $\bm{X}_{p}\in\mathbb{R}^{(\Gamma\cdot C)\times N}$ . For the ReduNet, we set the number of layers/iterations $L=40$ , precision $\epsilon=0.1$ , step size $\eta=0.5$ . Before the first layer, we perform lifting of the input by 1D circulant-convolution with 20 random Gaussian kernels of size 5.

(a) 𝑿 rotation \bm{X}_{\text{rotation}} bold_italic_X start_POSTSUBSCRIPT rotation end_POSTSUBSCRIPT — (a) $\bm{X}_{\text{rotation}}$

To evaluate the learned representation, each training sample is augmented by 20 of its rotated version, each shifted with stride=10. We compute the cosine similarities among the $m\times 20$ augmented training inputs $\bm{X}_{\text{rotation}}$ and the results are shown in Figure 4.11 (a). We compare the cosine similarities among the learned features of all the augmented versions, i.e., $\bar{\bm{Z}}_{\text{rotation}}$ and summarize the results in Figure 4.11 (b). As we see, the so-constructed rotation-invariant ReduNet is able to map the training data (as well as all its rotated versions) from the 10 different classes into 10 nearly orthogonal subspaces. That is, the learned subspaces are truly invariant to shift transformation in polar angle. Next, we randomly draw another $100$ test samples followed by the same augmentation procedure. In Figure 4.11 (c), we visualize the MCR² loss on the $\ell$ -th layer representation of the ReduNet on the training and test dataset. From these results, we can find that the constructed ReduNet is indeed able to maximize the MCR² loss as well as generalize to the test data.

$\blacksquare$

4.2 White-Box Transformers from Unrolled Optimization

As we have seen in the previous section, we use the problem of classification to provide a rigorous interpretation for main architectural characteristics of popular deep networks such as the ResNet and the CNN: each layer of such networks can be viewed as to imitate a gradient step which increases the rate reduction (or information gain) objective. This perspective also leads to a somewhat surprising fact: the the parameters and operators of the layers of such a deep network, the ReduNet, can be computed in a purely forward fashion.

Despite the theoretical and conceptual importance of the ReduNet, several factors limit it from being very practical. First, as we have discussed in the above, the computational cost of computing the matrix operators in each layer in a forward fashion can be very high. Second, the so-computed operators may not be so effective in optimizing the objective and it might take thousands of iterations (hence layers). As we have seen in Section 2.3.3 for LISTA, these two issues can be addressed by allowing to optimize those operators and make them learnable via back-propagation.¹²¹²12Or, perhaps, by a mixture of both forward and backward optimization.

The supervised classification setting in which the ReduNet was derived is also somewhat limiting. In practice, an image might not belong to a single class as it may contain multiple objects. Hence it would be more general to assume that different regions of the image belong to different low-dimensional models (say a Gaussian or a subspace). As we will see, such a generalization would lead to a both simple and general architecture which unifies the rate reduction and the denoising operations that we have seen in the previous chapter. Moreover, the so-obtained architecture resembles the popular Transformer architecture.

4.2.1 Unrolled Optimization for Sparse Rate Reduction

We consider a general learning setup associated with real-world signals. Let $\bm{X}=\begin{bmatrix}\bm{x}_{1},\dots,\bm{x}_{N}\end{bmatrix}\in\mathbb{R}^{D\times N}$ denote random variables representing our data source. In vision tasks, each $\bm{x}_{i}\in\mathbb{R}^{D}$ is interpreted as a token, typically corresponding to an image patch. In language tasks, each $\bm{x}_{i}\in\mathbb{R}^{D}$ is interpreted as an token embedding, i.e., a continuous vector representation of a discrete token such as a word or subword.¹³¹³13With a slight abuse of terminology, we refer to both the discrete tokens and their associated embeddings simply as tokens throughout this chapter for convenience. The $\bm{x}_{i}$ ’s may have arbitrary correlation structures. We use $\bm{Z}=\begin{bmatrix}\bm{z}_{1},\dots,\bm{z}_{N}\end{bmatrix}\in\mathbb{R}^{d\times N}$ to denote the random variables that defines our representations, where $\bm{z}_{i}\in\mathbb{R}^{d}$ is the representation of the corresponding token $\bm{x}_{i}\in\mathbb{R}^{D}$ .

Remark 4.3.

In transformers, each input sample is typically converted into a sequence of tokens. A token is a basic unit of information derived from the raw input: in natural language processing, tokens are typically words or subwords; in computer vision, they correspond to image patches; and in other modalities, they may represent time steps, spatial locations, or other domain-specific units. A token embedding is a continuous vector representation of a token that serves as the input to a transformer. It maps each token to a point in a high-dimensional space, enabling the model to process symbolic inputs using numerical computation. A token representation is a vector that encodes the semantic or structural information of a token, typically produced by the intermediate or final layers of a transformer. These representations are designed to capture meaningful features of the input that are useful for downstream tasks such as classification, generation, or regression. Please refer to Section 7.2 for more details about these concepts in implementations.

Objective for Learning a Structured and Compact Representation.

Following the framework of rate reduction Section 4.1, we contend that the goal of representation learning is to find a feature mapping $f\colon\bm{X}\in\mathbb{R}^{D\times N}\to\bm{Z}\in\mathbb{R}^{d\times N}$ which transforms input tokens $\{\bm{x}_{i}\}_{i=1}^{N}\subset\mathbb{R}^{D}$ with a potentially nonlinear and multi-modal distribution to a (piecewise) linearized and compact token representations $\{\bm{z}_{i}\}_{i=1}^{N}\subset\mathbb{R}^{d}$ . While the joint distribution of tokens representations $\{\bm{z}_{i}\}_{i=1}^{N}$ may be sophisticated (and task-specific), we further contend that it is reasonable and practical to require that the target marginal distribution of individual token representations should be highly compressed and structured, amenable for compact coding. Particularly, we require the distribution to be a mixture of low-dimensional (say $K$ ) Gaussian distributions, such that the $k$ -th Gaussian has mean $\mathbf{0}\in\mathbb{R}^{d}$ , covariance $\bm{\Sigma}_{k}\succeq\mathbf{0}\in\mathbb{R}^{d\times d}$ , and support spanned by the orthonormal basis $\bm{U}_{k}\in\mathbb{R}^{d\times p}$ . We denote $\bm{U}_{[K]}=\{\bm{U}_{k}\}_{k=1}^{K}$ to be the set of bases of all Gaussians. Hence, to maximize the information gain [MTS22] for the final token representations, we wish to maximize their rate reduction (see Section 3.4.2), i.e.,

\displaystyle\mathrm{max}_{\bm{Z}\in\mathbb{R}^{d\times N}}\ \Delta R_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]})\doteq R_{\epsilon}(\bm{Z})-R^{c}_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]}).

(4.2.1)

Here, the first term $R_{\epsilon}$ is an estimate of the lossy coding rate for the whole set of token representations. More specifically, if we view the token representations $\{\bm{z}_{i}\}_{i=1}^{N}$ as i.i.d. samples from a single zero-mean Gaussian, their lossy coding rate subject to a quantization precision $\epsilon>0$ is given as

R_{\epsilon}(\bm{Z})\doteq\frac{1}{2}\textrm{logdet}\left(\bm{I}+\frac{d}{N\epsilon^{2}}\bm{Z}^{\top}\bm{Z}\right)=\frac{1}{2}\textrm{logdet}\left(\bm{I}+\frac{d}{N\epsilon^{2}}\bm{Z}\bm{Z}^{\top}\right).

(4.2.2)

The second term $R_{\epsilon}^{c}$ is an estimate of the lossy coding rate under the codebook $\bm{U}_{[K]}$ , which is given as

\displaystyle R_{\epsilon}^{c}(\bm{Z}\mid\bm{U}_{[K]})

\displaystyle\doteq\sum_{k=1}^{K}R_{\epsilon}(\bm{U}_{k}^{\top}\bm{Z})=\frac{1}{2}\sum_{k=1}^{K}\log\det\left(\bm{I}+\frac{p}{N\epsilon^{2}}(\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})\right).

(4.2.3)

Remark 4.4.

The expression (4.2.3) for the coding rate can be viewed as a generalization of the coding rate $R_{\epsilon}^{c}$ used in the original rate reduction objective (3.4.13). In particular, the original objective is defined with respect to a set of known membership labels $\{\bm{\Pi}_{k}\}$ specific to the particular data realization $\bm{X}$ . In contrast, the current objective is defined with respect to subspaces $\bm{U}_{[K]}$ , which are independent of any particular realization but are assumed to support the distribution of token representations. Suppose that a token representation $\bm{z}_{i}$ belongs to a subspace $\bm{U}_{k}$ and these subspaces are approximately orthogonal to each other, i.e., $\bm{U}_{k}^{\top}\bm{U}_{l}\approx\bm{0}$ for all $k\neq l$ . Then, one can verify that the projections $\bm{U}_{k}\bm{U}_{k}^{\top}\bm{z}_{i}=\bm{z}_{i}$ and $\bm{U}_{l}\bm{U}_{l}^{\top}\bm{z}_{i}\approx\bm{0}$ for all $l\neq k$ . These orthogonal projections effectively serve as implicit membership labels, identifying the subspace to which each token representation belongs.

Figure 4.12 : Comparison of three sets of representations via rate reduction and sparsity. Each S i S_{i} italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents one linear subspace, and the number of blue balls represents the difference between the coding rates Δ R ϵ ( 𝒁 ∣ 𝑼 [ K ] ) = R ϵ ( 𝒁 ) − R ϵ c ( 𝒁 ∣ 𝑼 [ K ] ) \Delta R_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]})=R_{\epsilon}(\bm{Z})-R^{c}_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]}) roman_Δ italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ∣ bold_italic_U start_POSTSUBSCRIPT [ italic_K ] end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ) - italic_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( bold_italic_Z ∣ bold_italic_U start_POSTSUBSCRIPT [ italic_K ] end_POSTSUBSCRIPT ) . — Figure 4.12: Comparison of three sets of representations via rate reduction and sparsity. Each $S_{i}$ represents one linear subspace, and the number of blue balls represents the difference between the coding rates $\Delta R_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]})=R_{\epsilon}(\bm{Z})-R^{c}_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]})$ .

Sparse Rate Reduction.

Note that the rate reduction objective (4.2.1) is invariant to arbitrary joint rotations of the representations and subspaces. In particular, optimizing the rate reduction objective may not naturally lead to axis-aligned (i.e., sparse) representations. For instance, consider the three sets of learned representations in Figure 4.12. The coding rate reduction increases from (a) to (b), but because it is invariant under rotations, remains the same from (b) to (c). Therefore, we would like to transform the representations (and their supporting subspaces) so that the representations $\bm{Z}$ eventually become sparse¹⁴¹⁴14Concretely, having few nonzero entries. with respect to the standard coordinates of the resulting representation space as in Figure 4.12(c). Therefore, to ensure the final representations are amenable to more compact coding, we would like to transform the representations (and their supporting subspaces) so that they become sparse with respect to the standard coordinates of the resulting representation space.¹⁵¹⁵15That is, having the fewest nonzero entries. Computationally, we may combine the above two goals into a unified objective for optimization:

\max_{f\in\mathcal{F}}\ [\Delta R_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]})-\lambda\|\bm{Z}\|_{0}]\qquad\text{s.t.}\ \bm{Z}=f(\bm{X}),

(4.2.4)

where $\mathcal{F}$ denotes a general function class and the $\ell_{0}$ norm $\|\bm{Z}\|_{0}$ promotes the sparsity of the final token representations $\bm{Z}=f(\bm{X})$ .

In practice, the $\ell_{0}$ norm is often relaxed to the $\ell_{1}$ norm to improve computational traceability and enable convex optimization techniques [WM22]. Motivated by this, we relax Problem (4.2.4) accordingly, leading to a formulation that remains faithful to the original sparsity objective while being more amenable to efficient algorithms as follow:

\displaystyle\max_{f\in\mathcal{F}}\ [\Delta R_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]})-\lambda\|\bm{Z}\|_{1}]\qquad\text{s.t.}\ \bm{Z}=f(\bm{X}),

(4.2.5)

With a slight abuse of terminology, we often refer to this objective function also as the sparse rate reduction.

White-Box Network Architecture via Unrolled Optimization.

Although easy to state, each term in the above objective is computationally challenging to optimize [WM22]. Hence it is natural to adopt an approximation approach that realizes the global transformation $f$ to optimize (4.2.4) through a concatenation of multiple, say $L$ , simple incremental and local operations $f^{\ell}$ that push the representation distribution towards the desired parsimonious model distribution:

f\colon\bm{X}=\bm{Z}^{0}\xrightarrow{\hskip 2.84526ptf^{0}\hskip 2.84526pt}\bm{Z}^{1}\rightarrow\cdots\rightarrow\bm{Z}^{\ell}\xrightarrow{\hskip 2.84526ptf^{\ell}\hskip 2.84526pt}\bm{Z}^{\ell+1}\rightarrow\cdots\xrightarrow{\hskip 2.84526ptf^{L-1}}\bm{Z}^{L}=\bm{Z},

(4.2.6)

where $f^{0}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}$ is the pre-processing mapping that transforms each input token $\bm{x}_{i}\in\mathbb{R}^{D}$ to the initial token representations $\bm{z}_{i}^{1}\in\mathbb{R}^{d}$ . Each incremental forward mapping $\bm{Z}^{\ell+1}=f^{\ell}(\bm{Z}^{\ell})$ , or a “layer”, transforms the token distribution to optimize the above sparse rate reduction objective (4.2.4), conditioned on the distribution of its input $\bm{Z}^{\ell}$ .

Remark 4.5.

In contrast to other unrolled optimization approaches such as the ReduNet (see Section 4.1), we explicitly model the distribution of $\bm{Z}^{\ell}$ at each layer, say as a mixture of linear subspaces or sparsely generated from a dictionary. The model parameters are learned from data (say via backward propagation with end-to-end training). This separation between forward “optimization” and backward “learning” clarifies the mathematical role of each layer as an operator that transforms the distribution of its input, whereas the input distribution is in turn modeled (and subsequently learned) by the parameters of the layer.

Now, we show how to derive these incremental and local operations through an unrolled optimization perspective to solve Problem (4.2.5). Once we decide on using an incremental approach to optimizing Problem (4.2.5), there are a variety of possible choices to achieve the optimization. Given a model for $\bm{Z}^{\ell}$ , say a mixture of subspaces $\bm{U}_{[K]}$ , we opt for a two-step alternating minimization method with a strong conceptual basis. First, we compress the tokens $\bm{Z}^{\ell}$ via a gradient descent to minimize the coding rate term $R^{c}_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]}^{\ell})$ . Specifically, we take a gradient step on $R^{c}_{\epsilon}$ with a learning rate $\kappa$ as follows:

\displaystyle\bm{Z}^{\ell+1/2}=\bm{Z}^{\ell}-\kappa\nabla_{\bm{Z}}R^{c}_{\epsilon}(\bm{Z}\mid\bm{U}_{[K]}^{\ell}).

(4.2.7)

Next, we sparsify the compressed tokens, generating $\bm{Z}^{\ell+1}$ via a suitably-relaxed proximal gradient step to minimize the remaining term $\lambda\|\bm{Z}\|_{1}-R_{\epsilon}(\bm{Z})$ . As we will argue in detail later, we can find such a $\bm{Z}^{\ell+1}$ by solving a sparse presentation problem with respect to a dictionary $\bm{D}^{\ell}$ :

\bm{Z}^{\ell+1}=\operatorname*{arg\ min}_{{\bm{Z}}}\bigg{\{}\lambda\|\bm{Z}\|_{1}+\frac{1}{2}\|\bm{Z}^{\ell+1/2}-\bm{D}^{\ell}{\bm{Z}}\|_{F}^{2}\bigg{\}}.

(4.2.8)

In the following, we provide technical details for each of the two steps above and derive efficient updates for their implementation.

Self-Attention as Gradient Descent on Coding Rate of Token Representations.

For the first step (4.2.7), the gradient of the coding rate $\nabla_{\bm{Z}}R^{c}_{\epsilon}$ is costly to compute, as it involves $K$ separate matrix inverses, one for each of the $K$ subspaces with basis $\bm{U}_{k}^{\ell}$ :

\nabla_{\bm{Z}}R_{\epsilon}^{c}(\bm{Z}\mid\bm{U}_{[K]})=\frac{p}{N\epsilon^{2}}\sum_{k=1}^{K}\bm{U}_{k}\bm{U}_{k}^{\top}\bm{Z}\Big{(}\bm{I}+\frac{p}{N\epsilon^{2}}(\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})\Big{)}^{-1}.

(4.2.9)

Now, we demonstrate that this gradient can be naturally approximated using a so-called multi-head subspace self-attention (MSSA) operator, which has a similar functional form to the multi-head self-attention operator [VSP+17] with $K$ heads (i.e., one for each subspace, coming from each matrix inverse). Here, we approximate the gradient (4.2.9) using the first-order Neumann series (see Exercise 4.2):

	$\displaystyle\nabla_{\bm{Z}}R_{\epsilon}^{c}(\bm{Z}\mid\bm{U}_{[K]})$	$\displaystyle\approx\frac{p}{N\epsilon^{2}}\sum_{k=1}^{K}\bm{U}_{k}\bm{U}_{k}^{\top}\bm{Z}\left(\bm{I}-\frac{p}{N\epsilon^{2}}(\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})\right)$
		$\displaystyle{=\frac{p}{N\epsilon^{2}}\left(\sum_{k=1}^{K}\bm{U}_{k}\bm{U}_{k}^{\top}\right)\bm{Z}-\left(\frac{p}{N\epsilon^{2}}\right)^{2}\sum_{k=1}^{K}\bm{U}_{k}(\bm{U}_{k}^{\top}\bm{Z})(\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})}.$		(4.2.10)

In this approximation, we compute the similarity between projected token representations $\{\bm{U}_{k}^{\top}\bm{z}_{i}\}$ through an auto-correlation among the projected features as $(\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})$ and convert it to a distribution of membership with a softmax, namely $\operatorname{\mathrm{softmax}}{(\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})}$ . Suppose that a union of subspaces $\bm{U}_{[K]}$ spans the whole space. Then, we have $\sum_{k=1}^{K}\bm{U}_{k}\bm{U}_{k}^{\top}=\bm{I}$ . Hence, (4.2.1) becomes

\displaystyle\nabla_{\bm{Z}}R_{\epsilon}^{c}(\bm{Z}\mid\bm{U}_{[K]})\approx\frac{p}{N\epsilon^{2}}\bm{Z}-\left(\frac{p}{N\epsilon^{2}}\right)^{2}\operatorname{MSSA}\left(\bm{Z}^{\ell}\mid\bm{U}_{[K]}^{\ell}\right),

(4.2.11)

where MSSA is defined through an SSA operator as follows:

	$\displaystyle\mathrm{SSA}\left(\bm{Z}\mid\bm{U}_{k}\right)\doteq(\bm{U}_{k}^{\top}\bm{Z})\mathrm{softmax}\left((\bm{U}_{k}^{\top}\bm{Z})^{\top}(\bm{U}_{k}^{\top}\bm{Z})\right),\ \forall k\in[K],$		(4.2.12)
	$\displaystyle\mathrm{MSSA}\left(\bm{Z}\mid\bm{U}_{[K]}\right)\doteq\frac{p}{N\epsilon^{2}}\begin{bmatrix}\bm{U}_{1},\dots,\bm{U}_{K}\end{bmatrix}\begin{bmatrix}\mathrm{SSA}({\bm{Z}\mid\bm{U}_{1}})\\ \vdots\\ \mathrm{SSA}({\bm{Z}\mid\bm{U}_{K}})\end{bmatrix}.$		(4.2.13)

Substituting (4.2.11) into (4.2.7) yields that it can naturally approximated by

\bm{Z}^{\ell+1/2}=\left(1-\frac{\kappa p}{N\epsilon^{2}}\right)\bm{Z}^{\ell}+\frac{\kappa p}{N\epsilon^{2}}\mathrm{MSSA}\left(\bm{Z}^{\ell}\ \middle|\ \bm{U}_{[K]}^{\ell}\right).

(4.2.14)

Remark 4.6.

The SSA operator in (4.2.12) resembles the attention operator in a typical transformer [VSP+17], except that here the linear operators of value, key, and query are all set to be the same as the subspace basis, i.e., $\bm{V}_{k}=\bm{K}_{k}=\bm{Q}_{k}=\bm{U}_{k}^{*}$ . Hence, we name $\mathrm{SSA}({\,\cdot\,\mid\bm{U}_{k}}):\mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{p\times n}$ the Subspace Self-Attention (SSA) operator. Then, the whole MSSA operator in (4.2.13), formally defined as $\mathrm{MSSA}({\,\cdot\,\mid\bm{U}_{[K]}})\colon\mathbb{R}^{d\times n}\to\mathbb{R}^{d\times n}$ and called the Multi-Head Subspace Self-Attention (MSSA) operator, aggregates the attention head outputs by averaging using model-dependent weights, similar in concept to the popular multi-head self-attention operator in existing transformer networks. The overall gradient step (4.2.14) resembles the multi-head self-attention implemented with a skip connection in transformers.

MLP as Proximal Gradient Descent for Sparse Coding of Token Representations.

For the second step of alternating minimization, we need to minimize $\lambda\|\bm{Z}\|_{1}-R_{\epsilon}(\bm{Z})$ . Note that the gradient $\nabla R_{\epsilon}(\bm{Z})$ involves a matrix inverse, and thus naive proximal gradient (see Section A.1.3) to optimize this problem becomes intractable on large-scale problems. We therefore take a different, simplifying approach to trading off between representational diversity and sparsification: we posit a (complete) incoherent or orthogonal dictionary $\bm{D}^{\ell}\in\mathbb{R}^{d\times d}$ , and ask to sparsify the intermediate iterates $\bm{Z}^{\ell+1/2}$ with respect to $\bm{D}^{\ell}$ . That is, $\bm{Z}^{\ell+1/2}\approx\bm{D}^{\ell}\bm{Z}^{\ell+1}$ where $\bm{Z}^{\ell+1}$ is more sparse; that is, it is a sparse encoding of $\bm{Z}^{\ell+1/2}$ . The dictionary $\bm{D}^{\ell}$ is used to sparsify all tokens simultaneously. By the incoherence assumption, we have $(\bm{D}^{\ell})^{\top}(\bm{D}^{\ell})\approx\bm{I}$ . Thus from (4.2.2) we have

R_{\epsilon}(\bm{Z}^{\ell+1/2})\approx R_{\epsilon}(\bm{D}^{\ell}\bm{Z}^{\ell+1})\approx R_{\epsilon}(\bm{Z}^{\ell+1}).

(4.2.15)

To solve $\lambda\|\bm{Z}\|_{1}-R_{\epsilon}(\bm{Z})$ , we optimize the following problem

\displaystyle\bm{Z}^{\ell+1}\approx\operatorname*{arg\ min}_{\bm{Z}}\|\bm{Z}\|_{1}\quad\mbox{subject to}\quad\bm{Z}^{\ell+1/2}=\bm{D}^{\ell}\bm{Z}.

The above sparse representation program is usually solved by relaxing it to an unconstrained convex program, known as LASSO [WM22]:

\bm{Z}^{\ell+1}\approx\operatorname*{arg\ min}_{\bm{Z}}\left[\lambda\|\bm{Z}\|_{1}+\frac{1}{2}\|\bm{Z}^{\ell+1/2}-\bm{D}^{\ell}\bm{Z}\|_{F}^{2}\right].

(4.2.16)

In our implementation, we also add a non-negative constraint to $\bm{Z}^{\ell+1}$ , and solve the corresponding non-negative LASSO:

\bm{Z}^{\ell+1}\approx\operatorname*{arg\ min}_{\bm{Z}\geq\bm{0}}\left[\lambda\|\bm{Z}\|_{1}+\frac{1}{2}\|\bm{Z}^{\ell+1/2}-\bm{D}^{\ell}\bm{Z}\|_{F}^{2}\right].

(4.2.17)

Then, we incrementally optimize Equation 4.2.17 by performing an unrolled proximal gradient descent step, known as an ISTA step [BT09], to give the update:

	$\displaystyle\bm{Z}^{\ell+1}$	$\displaystyle=\mathrm{ISTA}({\bm{Z}^{\ell+1/2}\mid\bm{D}^{\ell}}),$		(4.2.18)
	$\displaystyle\text{where}\quad\mathrm{ISTA}({\bm{Z}\mid\bm{D}})$	$\displaystyle\doteq\operatorname{ReLU}(\bm{Z}-\eta\bm{D}^{\top}(\bm{D}\bm{Z}-\bm{Z})-\eta\lambda\bm{1}).$		(4.2.19)

Figure 4.13 : One layer of the CRATE encoder architecture. The full architecture is simply a concatenation of such layers, with some initial tokenizer, pre-processing head, and final task-specific head (i.e., a classification head). — Figure 4.13: One layer of the CRATE encoder architecture. The full architecture is simply a concatenation of such layers, with some initial tokenizer, pre-processing head, and final task-specific head (i.e., a classification head).

4.2.2 Overall White-Box Transformer Architecture: CRATE

We now design a white-box transformer architecture, named the Coding RATE Transformer (crate), by unrolling the above updates. By combining the above two steps (4.2.14) and (4.2.18):

1.

Local compression of tokens within a sample towards a mixture-of-subspace structure, leading to the multi-head subspace self-attention block – MSSA;
2.

Global sparsification of token sets across all samples through sparse coding, leading to the sparsification block – ISTA;

we can get the following rate-reduction-based transformer layer, illustrated in Figure 4.13,

\bm{Z}^{\ell+1/2}\doteq\bm{Z}^{\ell}+\texttt{MSSA}(\bm{Z}^{\ell}\mid\bm{U}_{[K]}^{\ell}),\qquad\bm{Z}^{\ell+1}\doteq\texttt{ISTA}(\bm{Z}^{\ell+1/2}\mid\bm{D}^{\ell}).

(4.2.20)

Composing multiple such layers following the incremental construction of our representation in (4.2.6), we obtain a white-box transformer architecture that transforms the data tokens towards a compact and sparse union of incoherent subspaces, where $f^{\mathrm{pre}}:\mathbb{R}^{D\times N}\rightarrow\mathbb{R}^{d\times N}$ is the pre-processing mapping that transforms the input tokens $\bm{X}\in\mathbb{R}^{D\times N}$ to first-layer representations $\bm{Z}^{1}\in\mathbb{R}^{d\times N}$ . An overall flow of this architecture was shown in Figure 4.14.

Figure 4.14 : The ‘main loop’ of the crate white-box deep network design. After encoding input data as a sequence of tokens 𝒁 0 \bm{Z}^{0} bold_italic_Z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , crate constructs a deep network that transforms the data to a canonical configuration of low-dimensional subspaces by successive compression against a local model for the distribution, generating 𝒁 ℓ + 1 / 2 \bm{Z}^{\ell+1/2} bold_italic_Z start_POSTSUPERSCRIPT roman_ℓ + 1 / 2 end_POSTSUPERSCRIPT , and sparsification against a global dictionary, generating 𝒁 ℓ + 1 \bm{Z}^{\ell+1} bold_italic_Z start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT . Repeatedly stacking these blocks and training the model parameters via backpropagation yields a powerful and interpretable representation of the data. — Figure 4.14: The ‘main loop’ of the crate white-box deep network design. After encoding input data as a sequence of tokens $\bm{Z}^{0}$ , crate constructs a deep network that transforms the data to a canonical configuration of low-dimensional subspaces by successive compression against a local model for the distribution, generating $\bm{Z}^{\ell+1/2}$ , and sparsification against a global dictionary, generating $\bm{Z}^{\ell+1}$ . Repeatedly stacking these blocks and training the model parameters via backpropagation yields a powerful and interpretable representation of the data.

Remark 4.7 (The roles of the forward pass and backward propagation).

In contrast to other unrolled optimization approaches such as the ReduNet [CYY+22], we explicitly model the distribution of each $\bm{Z}^{\ell}$ and $\bm{Z}^{\ell+1/2}$ at each layer, either by a mixture of linear subspaces or sparsely generated from a dictionary. We introduced the interpretation that at each layer $\ell$ , the learned bases for the subspaces $\bm{U}_{[K]}^{\ell}$ and the learned dictionaries $\bm{D}^{\ell}$ together serve as a codebook or analysis filter that encodes and transforms the intermediate representations at each layer $\ell$ . Since the input distribution to layer $\ell$ is first modeled by $\bm{U}_{[K]}^{\ell}$ then transformed by $\bm{D}^{\ell}$ , the input distribution to each layer is different, and so we require a separate codebook at each layer to obtain the most parsimonious encoding. Parameters of these codebooks (i.e., the subspace bases and the dictionaries), heretofore assumed as being perfectly known, are actually learned from data (say via backward propagation within end-to-end training).

Hence, our methodology features a clear conceptual separation between forward “optimization” and backward “learning” for the so-derived white-box deep neural network. Namely, in its forward pass, we interpret each layer as an operator which, conditioned on a learned model (i.e., a codebook) for the distribution of its input, transforms this distribution towards a more parsimonious representation. In its backward propagation, the codebook of this model, for the distribution of the input to each layer, is updated to better fit a certain (supervised) input-output relationship, as illustrated in Figure 4.15. This conceptual interpretation implies a certain agnosticism of the model representations towards the particular task and loss; in particular, many types of tasks and losses will ensure that the models at each layer are fit, which ensures that the model produces parsimonious representations.

We now present the empirical performance of the proposed networks crate by measuring their top-1 classification accuracy on ImageNet-1K as well as transfer learning performance on several widely used downstream datasets. We summarize the results in Table 4.1. The transfer learning methodology is to fine-tune using cross-entropy loss initializing from the pre-trained networks. As the designed white-box transformer architecture leverages parameter sharing in both the attention block (MSSA) and the nonlinearity block (ISTA), the crate-Base model (22.80 million) has a similar number of parameters to the ViT-Small (22.05 million) [DBK+21], and less than 30% of the parameters of an identically configured ViT-Base (86.54 million). From Table 4.1, we find that with a similar number of model parameters, our proposed network achieves similar ImageNet-1K and transfer learning performance as ViT, while having a simple and principled design. Moreover, with the same set of training hyperparameters, we observe promising scaling behavior in crate—we consistently improve the performance by scaling up the model size. To summarize, crate achieves promising performance on real-world large-scale datasets by directly implementing our principled architecture. We will provide more details of the implementation and analysis of the experimental results on image classification in the final application Chapter 7.

Table 4.1: Top-1 classification accuracy of crate on various datasets with different model scales when pre-trained on ImageNet-1K. For ImageNet-1K/ImageNet-1K ReaL, we directly evaluate the top-1 accuracy. For other datasets, we use models that are pre-trained on ImageNet as initialization and the evaluate the transfer learning performance via fine-tuning.

Model	crate-T	crate-S	crate-B	crate-L	ViT-T	ViT-S
# parameters	6.09M	13.12M	22.80M	77.64M	5.72M	22.05M
ImageNet-1K	66.7	69.2	70.8	71.3	71.5	72.4
ImageNet-1K ReaL	74.0	76.0	76.5	77.4	78.3	78.4
CIFAR10	95.5	96.0	96.8	97.2	96.6	97.2
CIFAR100	78.9	81.0	82.7	83.6	81.8	83.2
Oxford Flowers-102	84.6	87.1	88.7	88.3	85.1	88.5
Oxford-IIIT-Pets	81.4	84.9	85.3	87.4	88.5	88.6

4.3 Variants of Deep Architectures by Design

So far, we wish that we have provided compelling evidence that the role of (popular) deep networks is to realize certain optimization algorithms for minimizing the coding rate (or maximizing the information gain) of the learned representations. However, readers who are familiar with optimization methods might have noticed that the above architectures (the ReduNet or the CRATE) correspond to rather basic optimization techniques. They may have plenty of room for improvement in efficiency or effectiveness. Moreover, if we believe the proposed theoretical framework for interpreting deep networks is correct, it should not only help explain existing architectures, it should guide us develop more efficient and effective architectures. In this section, we show this could be the case: the resulting new architectures are not only fully interpretable but also with guaranteed correctness and improved efficiency.

4.3.1 Attention-Only Transformer Architecture

In this subsection, we propose a minimalistic transformer architecture consisting of interpretable layers based on the MSSA operator. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. Here, we assume that the initial token representations $\bm{Z}^{(1)}$ are sampled from a mixture of low-rank Gaussians perturbed by noise as follows:

Definition 4.1.

Let $C_{1},\dots,C_{K}$ be a partition of the index set $[N]$ and $\bm{U}_{k}\in\mathcal{O}^{d\times p_{k}}$ denote the orthonormal basis of the $k$ -th subspace for each $k\in[K]$ . We say that the token representations $\{\bm{z}_{i}\}_{i=1}^{N}\subseteq\mathbb{R}^{d}$ are sampled from a mixture of noisy low-rank Gaussian distributions if for each $k\in[K]$ ,

\displaystyle\bm{z}_{i}=\underbrace{\bm{U}_{k}\bm{a}_{i}}_{\bf signal}+\underbrace{\sum_{j\neq k}^{K}\bm{U}_{j}\bm{e}_{i,j}}_{\bf noise},\ \forall i\in C_{k},

(4.3.1)

where $\bm{a}_{i}\overset{i.i.d.}{\sim}\mathcal{N}(\bm{0},\bm{I}_{p_{k}})$ and $\bm{e}_{i,j}\overset{i.i.d.}{\sim}\mathcal{N}(\bm{0},\delta^{2}\bm{I}_{p_{j}})$ for all $i\in C_{k}$ and $k\in[K]$ , $\{\bm{a}_{i}\}$ and $\{\bm{e}_{i,j}\}$ are respectively mutually independent, and $\{\bm{a}_{i}\}$ is independent of $\{\bm{e}_{i,j}\}$ .

This model serves as an idealized framework for approximating token representations in real-world pretrained LLMs. It assumes that the token representations are sampled from a mixture of multiple low-rank Gaussian distributions with noise. Under this model, the goal of representation learning is to compress a set of noisy initial token presentations into the corresponding subspace. In addition, this model aligns well with two well-established hypotheses about the structure of token representations in pretrained large language models: the “linear representation hypothesis” [JRR+24, PCV24] and the “superposition hypothesis” [EHO+22, YCO+21].

Remark 4.8.

The linear representation hypothesis posits that token representations in LLMs lie in low-dimensional linear subspaces that encode semantic features. Similarly, the superposition hypothesis suggests that these representations can be approximately expressed as a sparse linear combination of these feature vectors. In Definition 4.1, each basis $\bm{U}_{k}$ of the subspaces can be interpreted as a set of semantic features, where each feature corresponds to a specific aspect of the token’s meaning. Token representations are then approximately expressed as sparse linear combinations of these subspace bases, capturing the essential semantic components of the token while ignoring irrelevant dimensions.

Denoising Operator for Token Representations.

Now, we show that the MSSA operator (see (4.2.13)) can incrementally denoise token representations generated from the above model. Specifically, we consider for each $\ell=1,\dots,L$ ,

\displaystyle\bm{Z}^{(\ell+1)}=\bm{Z}^{(\ell)}+\eta\sum_{k=1}^{K}\bm{U}_{k}\bm{U}_{k}^{T}\bm{Z}^{(\ell)}\varphi\left(\bm{Z}^{(\ell)^{T}}\bm{U}_{k}\bm{U}_{k}^{T}\bm{Z}^{(\ell)}\right),

(4.3.2)

where $\{\bm{U}_{k}\}_{k=1}^{K}$ is defined in Definition 4.1, $\eta>0$ is the step size, and $\varphi$ is an element-wise operator, such as softmax, ReLU, or other functions. To simplify our development, we assume that the subspaces in Definition 4.1 are orthogonal to each other, i.e., $\bm{U}_{k}^{T}\bm{U}_{j}=\bm{0}$ for all $k\neq j$ . Note that this assumption is not restrictive, as in high-dimensional spaces, random low-dimensional subspaces are incoherent to each other with high probability, i.e., $\bm{U}_{k}^{T}\bm{U}_{j}\approx\bm{0}$ [WM21].¹⁶¹⁶16One may straightforwardly generalize our results to non-orthogonal subspaces, with slightly more sophisticated analysis.

(a) Noise level δ = 0.2 \delta=0.2 italic_δ = 0.2 — (a) Noise level $\delta=0.2$

Now, let the columns of $\bm{Z}_{k}^{(\ell)}$ denote the token representations from the $k$ -th subspace at the $\ell$ -th layer. To quantify the denoising capability, we define the signal-to-noise ratio (SNR) for each block of the token representations at the $\ell$ -th layer as follows:

\displaystyle\mathrm{SNR}(\bm{Z}_{k}^{(\ell)})\doteq\frac{\|\bm{U}_{k}\bm{U}_{k}^{T}\bm{Z}_{k}^{(\ell)}\|_{F}}{\|(\bm{I}-\bm{U}_{k}\bm{U}_{k}^{T})\bm{Z}_{k}^{(\ell)}\|_{F}},\quad\forall k\in[K].

(4.3.3)

To simplify our analysis, we assume that $p=p_{1}=\dots=p_{K}$ , $N_{1}=\dots=N_{K}=N/K$ , and

\displaystyle\begin{bmatrix}\bm{U}_{1}&\dots&\bm{U}_{K}\end{bmatrix}\in\mathcal{O}^{d\times Kp}.

(4.3.4)

With the above setup, we now characterize the denoising performance of the MSSA operator.

Figure 4.17 : Details of the attention-only transformer architecture. Each layer consists of the MSSA operator and a skip connection. In addition, LayerNorm is included only for language tasks. In practice, backpropagation is applied to train the model parameters using training samples. — Figure 4.17: Details of the attention-only transformer architecture. Each layer consists of the MSSA operator and a skip connection. In addition, LayerNorm is included only for language tasks. In practice, backpropagation is applied to train the model parameters using training samples.

Theorem 4.1.

Let $\bm{Z}^{(1)}$ be defined in Definition 4.1 and $\varphi(\cdot)$ in (4.3.2) be $\varphi(\bm{x})=h\left(\sigma(\bm{x})\right)$ , where $\sigma:\mathbb{R}^{N}\to\mathbb{R}^{N}$ is the softmax function and $h:\mathbb{R}^{N}\to\mathbb{R}^{N}$ is an element-wise thresholding function with $h(x)=\tau\mathbb{I}\left\{x>\tau\right\}$ for each $i\in[N]$ . Suppose that $p\gtrsim\log N$ , $\delta\lesssim\sqrt{\log N}/\sqrt{p}$ , and

\displaystyle\tau\in\left(\frac{1}{2},\frac{1}{1+N\exp(-9p/32)}\right].

For sufficiently large $N$ , it holds with probability at least $1-KLN^{-\Omega(1)}$ that for each $\ell\in[L]$ ,

\displaystyle\mathrm{SNR}(\bm{Z}_{k}^{(\ell+1)})=(1+\eta\tau)\mathrm{SNR}(\bm{Z}_{k}^{(\ell)}),\ \forall k\in[K].

(4.3.5)

This theorem demonstrates that when the initial token representations are sampled from a mixture of low-rank Gaussian distributions with a noise level $O(\sqrt{\log N}/\sqrt{p})$ , we show that each layer of the proposed transformer denoises token representations at a linear rate. This indicates the MSSA operator’s efficiency in reducing noise across layers. Notably, our theoretical results are well-supported by experimental observations in Figure 4.16. This theorem provides a theoretical foundation for the practical denoising capability of the transformer architecture derived by unrolling (4.3.2).

Remark 4.9.

Under this model, the goal of representation learning is to compress a set of noisy initial token presentations into the corresponding subspace. However, we should point out that in real-world applications, where token representations exhibit more complicated structures, the goal of representation learning is to find a compact and structured representation by compressing token sets.

Attention-Only Transformer.

Now, we formally propose an attention-only transformer architecture. Specifically, by unrolling the iterative optimization steps (4.3.2) as layers of a deep network, we construct a transformer architecture in Figure 4.17. Each layer of the proposed architecture only consists of the MSSA operator and a skip connection. For language tasks, we additionally incorporate LayerNorm before the MSSA operator to improve performance. The complete architecture is built by stacking such layers, along with essential task-specific pre-processing and post-processing steps, such as positional encoding, token embedding, and final task-specific head to adapt to different applications.

Generally speaking, the standard decoder-only transformer architecture is composed of the following key components [VSP+17]: (1) positional encoding, (2) multi-head QKV self-attention mechanisms, (3) feed-forward MLP networks, (4) layer normalization, and (5) residual connections. In contrast, our proposed transformer architecture adopts a streamlined design by incorporating several key simplifications. Specifically, it employs shared-QKV subspace self-attention mechanisms, excludes MLP layers, and reduces the frequency of LayerNorm.

4.3.2 Linear-Time Attention: Token Statistics Transformer

In this subsection, we propose a new transformer attention operator whose computational complexity scales linearly with the number of tokens based on the coding rate reduction objective. Specifically, we derive a novel variational form of the MCR² objective and show that the architecture that results from unrolled gradient descent of this variational objective leads to a new attention module called Token Statistics Self-Attention (TSSA). TSSA has linear computational and memory complexity and radically departs from the typical attention architecture that computes pairwise similarities between tokens. Recall from (3.4.2) that $\bm{\Pi}=[\bm{\pi}_{1},\ldots,\bm{\pi}_{K}]\in\mathbb{R}^{N\times K}$ denotes a stochastic “group assignment” matrix (i.e., $\bm{\Pi}\bm{1}=\bm{1}$ and $\Pi_{ik}\geq 0,\ \forall(i,k)\in[N]\times[K]$ ), where $\Pi_{ik}$ denotes the probability of assigning the $i$ -th token to the $k$ -th group.

A New Variational Form for Coding Rates.

To begin, we consider a general form of MCR²-like objectives based on concave functions of the spectrum of a matrix. Namely, for a given PSD matrix $\bm{M}\in\mathsf{PSD}(d)$ and any scalar $c\geq 0$ we have that $\log\det(\bm{I}+c\bm{M})=\sum_{i=1}^{d}\log(1+c\lambda_{i}(\bm{M}))$ , where $\lambda_{i}(\bm{M})$ is the $i$ -th largest eigenvalue of $\bm{M}$ . Further, note that $\log(1+c\sigma)$ is a concave non-decreasing function of $\sigma$ . Thus, we describe our results in terms of a more general form of MCR² based on general spectral functions of PSD matrices of the form $F(\bm{M})=\sum_{i=1}^{d}f(\lambda_{i}(\bm{M}))$ , where $f$ is concave and non-decreasing. In particular, recall from our above discussion that the attention mechanism arises from unrolling the compression component of MCR², so we consider a more general MCR²-style compression function:

R_{c,f}(\bm{Z},\bm{\Pi})\doteq\frac{1}{2}\sum_{k=1}^{K}\frac{N_{k}}{N}F\left(\frac{1}{N_{k}}\bm{Z}\mathrm{Diag}(\bm{\pi}_{k})\bm{Z}^{\top}\right).

(4.3.6)

For the above objective, we now note the following result:

Theorem 4.2.

Let $f\colon[0,\infty)\to\mathbb{R}$ be non-decreasing, concave, and obey $f(0)=0$ , and let $F\colon\mathsf{PSD}(d)\to\mathbb{R}$ have the form $F(\bm{M})=\sum_{i=1}^{d}f(\lambda_{i}(\bm{M}))$ . Then for each $\bm{M}\in\mathsf{PSD}(d)$ and $\bm{Q}\in\mathsf{O}(d)$ , we have

F(\bm{M})\leq\sum_{i=1}^{d}f\left((\bm{Q}^{\top}\bm{M}\bm{Q})_{ii}\right).

(4.3.7)

Further, the inequality in (4.3.7) is achieved with equality for any $\bm{Q}$ which diagonalizes $\bm{M}$ , and if $f$ is strictly concave then the inequality in (4.3.7) is achieved with equality if and only if $\bm{Q}$ diagonalizes $\bm{M}$ .

Using the above result, we can replace (4.3.6) with an equivalent variational objective with form

R^{\rm var}_{c,f}(\bm{Z},\bm{\Pi}\mid\bm{U}_{[K]})\doteq\frac{1}{2}\sum_{k=1}^{K}\frac{N_{k}}{N}\sum_{i=1}^{d}f\left(\frac{1}{N_{k}}(\bm{U}_{k}^{\top}\bm{Z}\mathrm{Diag}(\bm{\pi}_{k})\bm{Z}^{\top}\bm{U}_{k})_{ii}\right),

(4.3.8)

where the equivalence is in the sense that for an optimal choice of $\{\bm{U}_{k}\in\mathsf{O}(d)\}_{k=1}^{K}$ matrices as described in Theorem 4.2 (i.e., orthogonal matrices which diagonalize each $\bm{Z}\mathrm{Diag}(\bm{\pi}_{k})\bm{Z}^{\top}$ ) we will achieve a tight bound with $R^{\rm var}_{c,f}(\bm{Z},\bm{\Pi}\mid\bm{U}_{[K]})=R_{c,f}(\bm{Z},\bm{\Pi})$ . Note that in general, achieving this bound would require selecting, for each sampled instance of $\bm{Z}$ , a new optimal set of $\bm{U}_{k}$ parameter matrices which diagonalize each $\bm{Z}\mathrm{Diag}(\bm{\pi}_{k})\bm{Z}^{\top}$ , which is clearly impractical for network architecture. Instead, as an alternative viewpoint, rather than considering the data ( $\bm{Z}$ ) as fixed and trying to optimize the $\bm{U}_{k}$ parameters to achieve the tight variational bound, we can instead take the algorithmic unrolling design principle described above and design an operator to perturb $\bm{Z}$ to incrementally minimize $R_{c,f}^{\rm var}(\cdot\mid\bm{U}_{[K]})$ . To make this point explicit, each variational bound becomes tight when the eigenspaces of $\bm{Z}\mathrm{Diag}(\bm{\pi}_{k})\bm{Z}^{\top}$ align with the columns of $\bm{U}_{k}$ , so by rotating the appropriate columns of $\bm{Z}$ (namely, those which correspond to large entries in $\bm{\pi}_{k}$ ) to align with $\bm{U}_{k}$ we can approach a tight variational bound. That is, instead of rotating $\bm{U}_{k}$ to align with the data for each instance of $\bm{Z}$ , we can instead rotate the token features in each $\bm{Z}$ to align with $\bm{U}_{k}$ .

Following this approach, we compute a gradient descent step on $R_{c,f}^{\rm var}$ w.r.t. $\bm{Z}$ . To begin this computation, first let $\bm{\pi}\in\mathbb{R}^{N}$ be any element-wise non-negative vector. Then we have

\nabla_{\bm{Z}}\ \frac{1}{2}\sum_{i=1}^{d}f((\bm{Z}\mathrm{Diag}(\bm{\pi})\bm{Z}^{\top})_{ii})=\;\mathrm{Diag}(\nabla f[\bm{Z}^{\mathbin{\mathchoice{\raisebox{1.3pt}{$\displaystyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{1.3pt}{$\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.75pt}{$\scriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.6pt}{$\scriptscriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}}2}\bm{\pi}])\bm{Z}\mathrm{Diag}(\bm{\pi}),

(4.3.9)

where $\nabla f$ is the gradient of $f$ , and (recall) $\nabla f[\cdot]$ applies $\nabla f$ to each element of the vector in the bracket. In particular, for $f(x)=\log(1+(d/\epsilon^{2})x)$ , $\nabla f(x)=(d/\epsilon^{2})(1+(d/\epsilon^{2})x)^{-1}$ is simply a non-linear activation. Also, (recall) $N_{k}=\langle\bm{\pi}_{k},\bm{1}\rangle$ . Thus, the gradient of $R^{\rm var}_{c,f}$ w.r.t. $\bm{Z}$ is:

\displaystyle\nabla_{\bm{Z}}R^{\rm var}_{c,f}(\bm{Z},\bm{\Pi}\mid\bm{U}_{[K]})=\frac{1}{n}\sum_{k=1}^{K}\bm{U}_{k}\underbrace{\mathrm{Diag}\left(\nabla f\left[(\bm{U}_{k}^{\top}\bm{Z})^{\mathbin{\mathchoice{\raisebox{1.3pt}{$\displaystyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{1.3pt}{$\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.75pt}{$\scriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.6pt}{$\scriptscriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}}2}\frac{\bm{\pi}_{k}}{\langle\bm{\pi}_{k},\bm{1}\rangle}\right]\right)}_{\doteq\bm{D}(\bm{Z},\bm{\pi}_{k}\mid\bm{U}_{k})}\bm{U}_{k}^{\top}\bm{Z}\mathrm{Diag}(\bm{\pi}_{k}).

(4.3.10)

(Note that the $1/N$ constant arises from a $(N_{k}/N)\cdot(1/N_{k})=1/N$ constant in each term of the sum.) If we now consider a gradient step w.r.t. the $j$ -th token $\bm{z}_{j}$ , we arrive at our proposed incremental compression operator, i.e., our surrogate for a self attention + residual operator:

\bm{z}_{j}^{+}=\bm{z}_{j}-\tau\nabla_{\bm{z}_{j}}R_{c,f}^{\rm var}(\bm{Z},\bm{\Pi}\mid\bm{U}_{[K]})=\bm{z}_{j}-\frac{\tau}{N}\sum_{k=1}^{K}\Pi_{jk}\bm{U}_{k}\bm{D}(\bm{Z},\bm{\pi}_{k}\mid\bm{U}_{k})\bm{U}_{k}^{\top}\bm{z}_{j}

(4.3.11)

for each $j\in[n]$ , where $\tau>0$ is a step size parameter for the incremental optimization. Then, we can construct a layer of TOST in Figure 4.18.

Figure 4.18 : One layer ℓ \ell roman_ℓ of the proposed Token Statistics Transformer (ToST). Notably, the self-attention of ToST transforms tokens 𝒁 ℓ \bm{Z}^{\ell} bold_italic_Z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT efficiently to 𝒁 ℓ + 1 \bm{Z}^{\ell+1} bold_italic_Z start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT , via multiplying each row of the projected token by only a scalar . This leads to reduced complexity of the attention: it has O ( p ) O(p) italic_O ( italic_p ) space and O ( p n ) O(pn) italic_O ( italic_p italic_n ) time complexity, where p p italic_p is the dimension of the projected tokens of each head, and n n italic_n is the number of tokens. — Figure 4.18: One layer $\ell$ of the proposed Token Statistics Transformer (ToST). Notably, the self-attention of ToST transforms tokens $\bm{Z}^{\ell}$ efficiently to $\bm{Z}^{\ell+1}$ , via multiplying each row of the projected token by only a scalar. This leads to reduced complexity of the attention: it has $O(p)$ space and $O(pn)$ time complexity, where $p$ is the dimension of the projected tokens of each head, and $n$ is the number of tokens.

Model interpretation.

Given the proposed attention operator in (4.3.11), first recall that the rows of $\bm{\Pi}$ are non-negative and sum to 1 , so our operator takes a weighted average of $K$ “attention head”-esque operators and then adds a residual connection. Using that $\sum_{k=1}^{K}\Pi_{jk}=1$ , we can rewrite (4.3.11) as:

\bm{z}_{j}^{+}=\sum_{k=1}^{K}\Pi_{jk}\Big{[}\bm{z}_{j}\underbrace{-\frac{\tau}{n}\bm{U}_{k}\bm{D}(\bm{Z},\bm{\pi}_{k}\mid\bm{U}_{k})\bm{U}_{k}^{\top}}_{\text{action of one attention head}}\bm{z}_{j}\Big{]}.

(4.3.12)

That is, we can view each attention head as first projecting the token features onto the basis $\bm{U}_{k}$ via multiplying by $\bm{U}_{k}^{\top}$ , multiplying by the diagonal matrix $\bm{D}(\bm{Z},\bm{\pi}_{k}\mid\bm{U}_{k})$ (abbreviated as $\bm{D}_{k}$ ), projecting back into the standard basis via multiplying by $\bm{U}_{k}$ , and subtracting this from the original token features via the residual connection. The core aspect of our attention layer is the computation of $\bm{D}_{k}$ . Namely, $\Pi_{jk}\geq 0$ , so $\bm{\pi}_{k}/\langle\bm{\pi}_{k},\bm{1}\rangle\in\mathbb{R}^{N}$ forms a probability distribution over which tokens belong to the $k^{\text{th}}$ group. As a result, $(\bm{U}^{\top}_{k}\bm{Z})^{\mathbin{\mathchoice{\raisebox{1.3pt}{$\displaystyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{1.3pt}{$\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.75pt}{$\scriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.6pt}{$\scriptscriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}}2}\bm{\pi}_{k}/\langle\bm{\pi}_{k},\bm{1}\rangle$ estimates the second moment of $\bm{U}_{k}^{\top}\bm{Z}$ under the distribution given by $\bm{\pi}_{k}/\langle\bm{\pi}_{k},\bm{1}\rangle$ . Further, since $f$ is a concave non-decreasing function, $\nabla f(x)$ monotonically decreases towards $0$ as $x$ increases, so the entries of $\bm{D}_{k}$ (which have form $\nabla f(x)$ ) achieve their maximum at $x=0$ and decay monotonically to $0$ as $x$ increases.

From this, we arrive at the core interpretation of our attention head + residual operators $[\bm{I}-(\tau/n)\bm{U}_{k}\bm{D}_{k}\bm{U}_{k}^{\top}]$ . Namely, this operator does an approximate low-rank data-dependent projection, where directions which have a large amount of “power” after the projection $\bm{U}_{k}^{\top}\bm{Z}$ (i.e., directions which have a large second moment $(\bm{U}_{k}^{\top}\bm{Z})^{\mathbin{\mathchoice{\raisebox{1.3pt}{$\displaystyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{1.3pt}{$\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.75pt}{$\scriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}{\raisebox{0.6pt}{$\scriptscriptstyle\mathchoice{\scalebox{0.8}{$\displaystyle\odot$}}{\scalebox{0.8}{$\textstyle\odot$}}{\scalebox{0.8}{$\scriptstyle\odot$}}{\scalebox{0.8}{$\scriptscriptstyle\odot$}}$}}}2}\bm{\pi}_{k}/\langle\bm{\pi}_{k},\bm{1}\rangle$ ) are preserved, while directions which do not are suppressed. To see this, recall that the entries of $\bm{D}_{k}$ decrease monotonically to 0 as the second moment increases, so for directions with large second moments the attention + residual operator acts largely as the identity operator. Conversely, for directions with a small second moment, our operator subtracts a projection of the tokens along those directions, resulting in those directions being suppressed. Compared to the standard self-attention operator, our method clearly does not compute any pairwise similarities between tokens. Rather, the interactions between the tokens in $\bm{Z}$ impact the operator solely through their contribution to the second moment statistic used to construct the $\bm{D}_{k}$ ’s. Nevertheless, similar to the standard self-attention operator, our method still has a clear interpretation as performing a form of compression towards a data-dependent low-rank structure, in the sense that it performs an approximate low-rank projection, where the specific directions that are suppressed are those which are not strongly aligned with other tokens in the group.

Practical Implementation Details.

Having introduced our proposed attention operator, we now discuss further practical considerations. First, until this point in the presentation, we have avoided discussion of how tokens are “grouped” into various attention heads via the $\bm{\Pi}$ matrix, but clearly a means of constructing $\bm{\Pi}$ is needed to implement our method. Additionally, our variational form in Theorem 4.2 requires the $\bm{U}$ matrices to be square and orthogonal, but one would ideally like to use smaller matrices (i.e., reduce the number of columns in $\bm{U}$ ) for efficiency as well as drop the orthogonality constraints.

In practice, we do not enforce the orthogonality constraints. To reduce the number of columns in the $\bm{U}$ matrices, we note that similar to CRATE [YBP+23], if we assume the features $\bm{Z}$ within group $k$ are (approximately) clustered around a low-dimensional subspace — say of dimension $p$ — then the within-group- $k$ covariance $\bm{Z}\mathrm{Diag}(\bm{\pi}_{k})\bm{Z}^{\top}$ is low-rank, where recall that [YCY+20] shows that the optimal geometry of $\bm{Z}$ will be for each group to be a low-rank subspace, orthogonal to the other groups. We can thus explicitly find a low-dimensional orthonormal basis for the image of this covariance, i.e., the linear span of the data in group $k$ . If the dimension is $p\leq d$ , the basis can be represented by a $d\times p$ orthogonal matrix $\bm{U}_{k}\in\mathsf{O}(d,p)$ . In this case, we can more efficiently upper-bound $R_{c,f}$ using these low-rank orthogonal basis matrices. To show this, we use a more general version of Theorem 4.2 to yield the following corollary.

Corollary 4.1.

Let $f\colon[0,\infty)\to\mathbb{R}$ be non-decreasing, concave, and obey $f(0)=0$ , and let $F\colon\mathsf{PSD}(p)\to\mathbb{R}$ have the form $F(\bm{M})=\sum_{i=1}^{p}f(\lambda_{i}(\bm{M}))$ . Let $\bm{Z}$ , $\bm{\Pi}$ be fixed. Then, for all $\bm{U}_{1},\dots,\bm{U}_{K}\in\mathsf{O}(d,p)$ such that $\mathrm{image}(\bm{Z}\operatorname{diag}(\bm{\pi}_{k})\bm{Z}^{\top})\subset\mathrm{image}(\bm{U}_{k})$ for all $k\in[K]$ , we have

R_{c,f}(\bm{Z},\bm{\Pi})\leq R_{c,f}^{\rm var}(\bm{Z},\bm{\Pi}\mid\bm{U}_{[K]}),

(4.3.13)

where $R_{c,f}^{\rm var}$ is formally defined in (4.3.8). Equality holds if $\bm{U}_{k}$ diagonalizes $\bm{Z}\operatorname{diag}(\bm{\pi}_{k})\bm{Z}^{\top}$ for each $k\in[K]$ , and if $f$ is strongly concave then this equality condition becomes an “if and only if.”

The final step to define our attention operator is to estimate the group membership $\bm{\Pi}$ . For this we posit a simple model of how each feature $\bm{z}_{j}$ deviates from its supporting subspace and then find the optimal subspace assignment. [YBP+23] show that if we independently model each $\bm{z}_{j}$ as belonging to a low-dimensional Gaussian mixture model, where each Gaussian has a covariance matrix with identical spectrum and is supported on a subspace with orthonormal basis $\bm{U}_{k}$ , plus independent Gaussian noise with covariance $\eta\bm{I}$ , then the posterior probability that each token $\bm{z}_{j}$ belongs to each subspace is given by the assignment matrix $\bm{\Pi}=\bm{\Pi}(\bm{Z}\mid\bm{U}_{[K]})$ as follows:

\displaystyle\bm{\Pi}=\begin{bmatrix}\bm{\nu}(\bm{z}_{1}\mid\bm{U}_{[K]})^{\top}\\ \vdots\\ \bm{\nu}(\bm{z}_{n}\mid\bm{U}_{[K]})^{\top}\end{bmatrix},\quad\text{where}\quad\bm{\nu}(\bm{z}_{j}\mid\bm{U}_{[K]})\doteq\operatorname{softmax}\left(\frac{1}{2\eta}\begin{bmatrix}\|\bm{U}_{1}^{\top}\bm{z}_{j}\|_{2}^{2}\\ \vdots\\ \|\bm{U}_{K}^{\top}\bm{z}_{j}\|_{2}^{2}\end{bmatrix}\right),\quad\forall j\in[n],

(4.3.14)

where $\eta$ becomes a learnable temperature parameter. Thus, given an input feature $\bm{Z}$ , we estimate $\bm{\Pi}$ using (4.3.14) and then compute the attention operator. Combining the construction of $\bm{\Pi}$ in (4.3.14) with (4.3.11), we obtain the Token Statistics Self-Attention operator:

\texttt{TSSA}(\bm{Z}\mid\bm{U}_{[K]})\doteq-\frac{\tau}{n}\sum_{k=1}^{K}\bm{U}_{k}\bm{D}(\bm{Z},\bm{\pi}_{k}\mid\bm{U}_{k})\bm{U}_{k}^{\top}\bm{Z}\operatorname{diag}(\bm{\pi}_{k}),

(4.3.15)

where $\bm{\pi}_{k}$ are the columns of $\bm{\Pi}=\bm{\Pi}(\bm{Z}\mid\bm{U}_{[K]})$ defined in (4.3.14) and $\bm{D}$ is defined in (4.3.10).

4.4 Summary and Notes

The materials presented in this chapter are based on a series of recent works on this topic, including [CYY+22, WLP+24, WLY+25, WDL+25, YBP+23]. These contributions encompass both theoretical advances and practical methodologies for constructing interpretable deep networks through unrolled optimization. Many of the key results and proofs discussed in this chapter are derived directly from, or inspired by, these foundational works.

The idea of unrolling an optimization algorithm to construct a neural network traces back to the seminal work [GL10]. In this work, the authors demonstrated that sparse coding algorithms—such as the Iterative Shrinkage-Thresholding Algorithm (ISTA)—can be unrolled to form multilayer perceptrons (MLPs), effectively bridging iterative optimization and neural network design. Notably, [MLE19] demonstrated that such unrolled networks are more interpretable, parameter-efficient, and effective compared to generic networks. In this chapter, we build on this perspective to develop principled, white-box deep network architectures by unrolling optimization algorithms that are designed to minimize well-motivated objectives—such as the (sparse) rate reduction objective introduced earlier. This approach not only clarifies the role of each layer in the network but also offers theoretical grounding for architectural choices, moving beyond empirical trial-and-error toward interpretable and goal-driven design. In the following, we compare conventional DNNs, which are typically constructed through empirical design and heuristic tuning, with our mathematically grounded ReduNet architectures:

	Conventional DNNs	ReduNets
Objectives	input/output fitting	information gain
Deep architectures	trial & error	iterative optimization
Layer operators	empirical	projected gradient
Shift invariance	CNNs+augmentation	invariant ReduNets
Initializations	random/pre-design	forward unrolled
Training/fine-tuning	back prop	forward/back prop
Interpretability	black box	white box
Representations	hidden/latent	incoherent subspaces

4.5 Exercises and Extensions

Exercise 4.1.

Let $\bm{Z}=[\bm{Z}_{1},\dots,\bm{Z}_{K}]\in\mathbb{R}^{d\times m}$ with $\bm{Z}_{k}\in\mathbb{R}^{d\times m_{k}}$ for each $k\in[K]$ . For some $\alpha>0$ , let

\displaystyle R(\bm{Z})=\log\det\left(\bm{I}+\alpha\bm{Z}\bm{Z}^{T}\right).

1. Given any direction $\bm{D}\in\mathbb{R}^{d\times m}$ , please show that $\nabla R(\bm{Z})=\alpha\bm{X}^{-1}\bm{Z}$ and

\displaystyle\nabla^{2}R(\bm{Z})[\bm{D},\bm{D}]=\alpha\mathrm{Tr}\left(\bm{X}^{-1}\bm{D}\bm{D}^{T}\right)-\frac{\alpha^{2}}{2}\mathrm{Tr}\left(\bm{X}^{-1}\left(\bm{Z}\bm{D}^{T}+\bm{D}\bm{Z}^{T}\right)\bm{X}^{-1}\left(\bm{Z}\bm{D}^{T}+\bm{D}\bm{Z}^{T}\right)\right),

where $\bm{X}\doteq\bm{I}+\alpha\bm{Z}\bm{Z}^{T}$ . Hint: Note that

\displaystyle\nabla^{2}R(\bm{Z})[\bm{D},\bm{D}]\doteq\left\langle\lim_{t\to 0}\frac{\nabla R(\bm{Z}+t\bm{D})-\nabla R(\bm{Z})}{t},\bm{D}\right\rangle.

2. Please show that

\displaystyle R(\bm{Z})\leq\sum_{k=1}^{K}\log\det\left(\bm{I}+\alpha\bm{Z}_{k}\bm{Z}_{k}^{T}\right),

where the equality holds if and only if $\bm{Z}_{k}^{T}\bm{Z}_{l}=\bm{0}$ for all $k\neq l\in[K]$ .

3. Given some $\alpha>0$ , let $\alpha_{k}=m\alpha/m_{k}$ for each $k\in[K]$ . Please derive the closed-form for the first-order critical point of the following function:

\displaystyle f(\bm{Z}_{k})=\frac{1}{2}\log\det\left(\bm{I}+\alpha\bm{Z}_{k}\bm{Z}_{k}^{T}\right)-\frac{m_{k}}{2m}\log\det\left(\bm{I}+\alpha_{k}\bm{Z}_{k}\bm{Z}_{k}^{T}\right)-\frac{\lambda}{2}\|\bm{Z}_{k}\|_{F}^{2}.

Hint: Let $r_{k}=\mathrm{rank}(\bm{Z}_{k})$ . Consider the following singular value decomposition of $\bm{Z}_{k}$ :

\displaystyle\bm{Z}_{k}=\bm{P}_{k}\bm{\Sigma}_{k}\bm{Q}_{k}^{T}=\begin{bmatrix}\bm{P}_{k,1}&\bm{P}_{k,2}\end{bmatrix}\begin{bmatrix}\tilde{\bm{\Sigma}}_{k}&\bm{0}\\ \bm{0}&\bm{0}\end{bmatrix}\begin{bmatrix}\bm{Q}_{k,1}^{T}\\ \bm{Q}_{k,2}^{T}\end{bmatrix},

where $\bm{P}_{k}\in\mathcal{O}^{d}$ with $\bm{P}_{k,1}\in\mathbb{R}^{d\times r_{k}}$ and $\bm{P}_{k,2}\in\mathbb{R}^{d\times(d-r_{k})}$ , $\bm{\Sigma}_{k}\in\mathbb{R}^{d\times m_{k}}$ with $\tilde{\bm{\Sigma}}_{k}\in\mathbb{R}^{r_{k}\times r_{k}}$ being a diagonal matrix, and $\bm{Q}_{k}\in\mathcal{O}^{m_{k}}$ with $\bm{Q}_{k,1}\in\mathbb{R}^{m_{k}\times r_{k}}$ and $\bm{P}_{k,2}\in\mathbb{R}^{m_{k}\times(m_{k}-r_{k})}$ .

Exercise 4.2 (Neumann series for matrix inverse).

Let $\bm{A}\in\mathbb{R}^{n\times n}$ . If $\|\bm{A}\|<1$ , please show

\displaystyle\left(\bm{I}-\bm{A}\right)^{-1}=\sum_{k=1}^{\infty}\bm{A}^{k}.

(4.5.1)

Hint: The proof consists of two steps.
(i) Step 1: Please show that the infinite series $\sum_{k=1}^{\infty}\bm{A}^{k}$ converges when $\bm{A}<1$ using $\|\bm{A}^{k}\|\leq\|\bm{A}\|^{k}$ .
(ii) Step 2: Compute the matrix product $(\bm{I}-\bm{A})\sum_{k=1}^{\infty}\bm{A}^{k}$ .

Exercise 4.3.

Please compute the gradients in (4.3.9) and (4.3.10).

Exercise 4.4.

Please show Corollary 4.1 when $Kp\leq d$ .