“Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy.”
– Václav Havel
The world in which we live is neither fully random nor completely unpredictable.111Note there is no need for an intelligent being to learn or memorize anything if the world is fully random. Instead, it follows certain orders, patterns, and laws that make it largely predictable.222Some deterministic and some probabilistic. The very emergence and existence of life depend on a predictable living environment. Only by learning and memorizing what is predictable in the environment can life survive and thrive since good decisions and actions depend on reliable predictions. Because there seem to be unlimited things that are predictable about the world, intelligent beings, such as animals and humans, have continued to improve through evolution their capability to explore and exploit such predictability for an increasingly better life. To this end, they have developed increasingly more acute senses, including vision, hearing, touch, taste, and smell, to perceive what is predictable in the external environment from these high-throughput sensory data. Hence a fundamental task for all intelligent beings is to be able to:
learn and memorize predictable information from massive sensed data.
Before we may begin to understand how this is done, we need to consider a few related questions:
How to model and represent such predictable information in the data mathematically?
How can such information be learned effectively and efficiently from the data computationally?
How should such information be best organized for future prediction and inference?
This book aims to provide some answers to these questions. These answers will help us better understand intelligence, especially the computational principles and mechanisms that enable it. There is reason to believe that all forms of intelligence, from low-level intelligence seen in early primitive life to the highest form of intelligence, the practice of modern science, follow the same set of principles and mechanisms. We will elaborate more below.
A necessary condition for the emergence of life on earth about 4 billion years ago is that the earth’s environment is largely predictable. In the environment, life has developed mechanisms that allow it to learn what is predictable about the environment, encode the information in a certain way, and use it for survival. Generally speaking, we call this ability to learn intelligence. To a large extent, the evolution of life is the mechanism of intelligence at work. In nature, intelligence is mainly developed through two types of learning mechanisms: phylogenetic and ontogenetic [Wie61].
Phylogenetic intelligence refers to learning through the evolution of species. Species inherit and survive mainly based on knowledge inherited from the DNA or genes of their parents. To a large extent, we may call DNA nature’s pre-trained large models because they play a very similar role. The main characteristic of phylogenetic intelligence is that individuals do not have much learning capacity. Learning is carried out with a “trial-and-error” mechanism based on random mutation of the genes, and then species evolve based on the process of natural selection – survival of the fittest, as shown in Figure 1.1. This can be viewed as nature’s way of implementing what is now known as “reinforcement learning.” However, such a “trial-and-error” learning process can be extremely slow, costly, and unpredictable: It is known that from the emergence of the first life forms, from about 4.4 to 3.8 billion years ago, life has relied on this form of evolution.333Astute readers may have noticed an uncanny similarity between how early life evolves and how large language models evolve today.
Ontogenetic intelligence refers to the learning mechanisms that allow an individual to learn through its own senses, memories, and predictions within its specific living environment and to improve and adapt its behaviors. Ontogenetic learning became possible after the emergence of the nervous system about 550 to 600 million years ago (in worm-like organisms), shown in Figure 1.2 middle. That is, with a sensory and nervous system, an individual is capable of continuously forming and improving its own knowledge about the world, also known as memory, in addition to what is inherited from its DNA or genes. This capability has significantly enhanced the survival of the individual and contributed to the explosion of life forms in the Cambrian period about 530 million years ago. Compared to phylogenetic learning, ontogenetic learning is significantly more efficient and predictable, which can be realized within the resource limits of an individual in its lifespan.
Notice that both types of learning mechanisms rely on some form of feedback (from the external environment), in terms of a penalty (death) or reward (food), of a species or individuals’ actions444Gene mutation of the species or actions made by the individual. to learn. As a result, all intelligent beings, as species or as individuals, rely on a closed-loop feedback mechanism to learn and improve their knowledge about the world. We also notice that from plants, to fish, to birds, and to mammals, more advanced species rely more and more on their ontogenetic learning capabilities. They stay with and learn from their parents longer and longer after birth, because individuals of the same species need to survive in very diverse environments.
Since the emergence of homo sapiens about 315 thousand years ago, a new and higher form of intelligence has emerged which evolves more efficiently and economically. Languages, first spoken555It was believed that Sanskrit was the first spoken language, dating back to 5000 BC. and then written666Sumerian language is believed to be one of the oldest written languages in existence, first attested about 3100 BC in southern Mesopotamia., were developed a few thousand years ago. See Figure 1.3. This allows individuals to communicate and share useful information with others. Therefore, a human community or society can behave like a single intelligent organism that can learn much faster and hold more knowledge than any individual. In a way, written languages, or texts, play a role similar to DNA and genes as they allow human societies to accumulate and pass knowledge of the world onto the next generations. We may refer to this type of intelligence as societal intelligence, to distinguish it from the phylogenetic intelligence of species and the ontogenetic intelligence of individuals. This type of knowledge accumulation serves as the foundation of ancient civilizations.
Quite miraculously, about a few thousand years ago, another quantum leap in human intelligence occurred, which allowed philosophers and mathematicians to develop knowledge that seems to go way beyond developing empirical knowledge. The development of abstract mathematical concepts and symbols, such as numbers, space and time, as well as mathematical logic, serve as a new precise language for modern science. In addition, the development of the ability to generate new hypotheses and verify their correctness based on logical deduction or scientific experimentation. This, for the first time, has enabled humans to proactively develop new knowledge beyond passive empirical means. The ability to conduct these high-level forms of knowledge development is believed to be unique to humans. This advanced form of intelligence is referred to as “artificial intelligence” (AI), coined by John McCarthy at the Dartmouth summer workshop in 1956.
Hence, from what we can learn from nature, from now on, whenever we use the word “intelligence,” we need to be very specific about which level/form of intelligence we mean:
(1.1.1) |
A clear characterization and distinction are necessary and important because we want to study intelligence as a scientific and mathematical subject. It is highly likely that, even if they all may share the common objective of learning useful knowledge of the world, the specific computational mechanisms and physical implementations behind each level/form of intelligence could be different. We believe that the reader would better understand and appreciate these differences after having finished study this book. Therefore, we will leave more discussions about general intelligence to the last Chapter 8.
In the 1940s, partly due to the war effort, intelligence in nature had inspired scientists to emulate animal intelligence by machines, which led to the “Cybernetics” movement advocated by Norbert Wiener. Wiener studied zoology at Harvard as an undergraduate but later became a mathematician and control theorist. Wiener had a lifelong passion for understanding and developing autonomous systems that could emulate intelligent behaviors of animals. Today, the Cybernetics program is often narrowly interpreted by people as mainly about feedback control systems for which Wiener indeed made his most significant technical contributions. But the Cybernetics program was much broader and deeper than that. It is more about understanding intelligence as a whole777At least at the level of animals. and had actually influenced the work of a whole generation of renowned scientists, including Warren McCulloch, Walter Pitts, Claude Shannon, John von Neumann, and Alan Turing.
Wiener was arguably the first person who studied intelligence as a system, instead of paying attention to only one component or aspect of it. His comprehensive views on intelligence were elaborated in his famous 1948 book “Cybernetics: or Control and Communication in the Animal and the Machine” [Wie48]. In this book and its second edition published in 1961 [Wie61] (see Figure 1.4), he tried to identify several necessary characteristics and mechanisms of intelligent systems, which include (but are not limited to):
How to measure and store information (in the brain) and how to communicate with others. 888Norbert Wiener was the first to point out that “information” is not matter nor energy, but an independent quantity for study. This led to the formulation of information theory and coding theory by Claude Shannon in 1948.
How to correct errors in prediction and estimation based on existing information. Norbert Wiener himself helped formalize the theory for control systems based on closed-loop feedback in the 1940s.
How to learn to make better decisions from interacting with a potentially non-cooperative opponent or adversarial environment. This was formalized by John von Neumann as game theory in 1944.
In 1943, very much motivated by Wiener’s Cybernetics program, the psychiatrist Warren McCulloch and the logician Walter Pitts together formalized the first computational model for a neuron [MP43], called an artificial neuron, as illustrated later in Figure 1.13. Based on this model, in the 1950s, Frank Rosenblatt built a physical machine, named the Mark I Perceptron, with a network of hundreds of such artificial neurons. Perceptron was the first artificial neural network physically realized, see Figure 1.15. Notably, John von Neumann’s universal computer architecture, proposed in 1945, was also designed to facilitate the goal of building computing machines that can physically realize the mechanisms suggested by the Cybernetics program.
Acute readers probably have noticed that the 1940s was truly a magical era: So many fundamental ideas were invented and influential theories formalized in that era, including the mathematical model of neurons, artificial neural networks, information theory, control theory, game theory, and computing machines. Figure 1.5 shows some of the pioneers of these theories. As we now know, each work above had grown to become the foundation of a scientific or engineering field for the following many decades and has tremendous impact on us. All these fundamental theories were inspired and motivated by the goal of trying to develop machines that can emulate intelligence in nature. Based on historical notes, Wiener’s Cybernetics movement had influenced almost all of these people and work. To a large extent, the Cybernetics program laid out by Wiener can be viewed as the true predecessor to the currently very popular “embodied” intelligence program. In fact, Wiener had described the program in much more concrete terms [Wie61].
Although Wiener had identified in his work many key characteristics and mechanisms of (embodied) intelligence, there was no indication that he knew how to properly integrate all these mechanisms together to build a complete autonomous intelligent system. Judging from today’s knowledge, some of his views on these mechanisms were not entirely accurate or complete. In particular, in the last chapter of the second edition of Cybernetics [Wie61], he pointed out that it is crucial to deal with nonlinearality if a machine learning system is designed to emulate typical learning mechanisms in nature. But he did not provide any concrete and effective solutions to this difficult issue. To his defense though, at the time, few people knew how, since even the theory for dealing with linear models and systems was still in its infancy.
Nevertheless, we could not help but marvel at Wiener’s foresight about the importance of nonlinearity. As we will see in this book, the answer was found only recently: nonlinearity can be effectively dealt with through progressive linearization and transformation realized by deep neural networks (see Chapter 4). In addition, we will attempt to show in this book how all these mechanisms listed above can be naturally integrated into a complete system which would exhibit characteristics of an autonomous intelligent system (see Chapter 5).
From the subtitle of Wiener’s Cybernetics book: “Control and Communication in the Animal and the Machine”, one can tell that the studies in the 1940s mainly aimed to emulate intelligence at the level of animals. As we mentioned before, the research agendas about intelligence around the 1940s were very much dominated by Norbert Wiener’s Cybernetics movement.
Alan Turing was one of the first to notice this limitation. In his famous 1950 paper “Computing Machinery and Intelligence” [Tur50], Turing formally posted the question whether machines can imitate intelligence even at the human level, to the point of being indistinguishable from the intelligent capabilities of humans. This is now known as the Turing test.
Around 1955, a group of young and ambitious scientists tried to break away from the then dominating Cybernetics movement and research agendas so that they would have a chance to create their own legacy. They decided to take on Turing’s challenge of imitating human intelligence and proposed a workshop to be held at Dartmouth College in the summer of 1956. They made their intention clear with a statement in their proposal:
“The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.”
In essence, they wanted to formalize and study higher-level intelligence that differentiates humans from animals. The topics they considered ranged from abstraction, symbolic methods, natural languages, and deductive methods (including causal inference, logic deduction, etc.) The organizer of the workshop, John McCarthy, then a young assistant professor of Mathematics of the Dartmouth College, coined the now famous term “Artificial Intelligence” (AI) to encapsulate the set of characteristics or mechanisms that are believed to be unique to human intelligence.
As the readers may have known, in the past decade or so, machine intelligence has undergone explosive development, powered mainly by the practice of deep artificial neural networks, triggered by the work of Geoffrey Hinton and students in 2012 [KSH12]. This era is also hailed as the “Renaissance” of Artificial Intelligence (AI). However, in terms of tasks that people have actually tried to tackle (recognition, generation, and prediction) and techniques that people have developed and implemented so far (reinforcing, imitating, encoding, decoding, denoising, and compression), we are very much just emulating the mechanisms that are common to the intelligence of early life and animals. Even in that regard, as we will try to clarify in this book, current “AI” models and systems have not correctly implemented all necessary mechanisms for intelligence at these levels, which were already known to the Cybernetics movement in the 1940s.
Hence, strictly speaking, the advancement of machine intelligence in the past decade does not align well with the “Artificial Intelligence” program laid out in the 1956 Dartmouth workshop. Instead, what has been predominantly accomplished so far is more closely related to the objectives of the classic “Cybernetics” program laid out by Norbert Wiener in the 1940s. It is probably more appropriate to call the current era the “Renaissance of Cybernetics”.999The recent rise of the so-called “Embodied AI” for autonomous intelligent robots share even more similarity with the goals of the Cybernetics program. Only after we have fully understood what we have truly done from the scientific and mathematical perspective, can we truly know what remains to be done and which direction to go to pursue the true nature of intelligence. This is one of the main purposes of this book.
Data that carry useful information manifest in many different forms. In the most natural form, they can be modeled as sequences that are predictable and computable. The notion and properties of a predictable and computable sequence were at the heart of the theory of computing and largely led to the invention of computers [Tur36]. The role of predictable sequences in (inductive) inference was studied by Ray Solomonoff, Andrey Kolmogorov, and many others in the 1960s [Kol98] as a generalization of Claude Shannon’s classic Information Theory [Sha48]. To understand the concept of predictable sequences, let us first start with some concrete simple examples.
The simplest predictable discrete sequence is arguably the sequence of natural numbers:
(1.2.1) |
in which the next number is defined as its previous number plus 1:
(1.2.2) |
One may generalize the notion of predictability to any sequence with if the next number can always be computed from its previous one :
(1.2.3) |
where is a computable (scalar) function.101010Here we emphasize that the function itself is computable, meaning it can be implemented as a program on a computer. Note that here we emphasize that the function must be computable. There are many functions that can be defined but are not computable. Alan Turing’s seminal work in 1936 [Tur36] gives a rigorous definition of computability. In practice, we often further assume that is efficiently computable and has nice properties such as being continuous and differentiable, etc. The necessity of these properties will become clear later once we understand more about more refined notions of computability, and their roles in machine learning and intelligence.
Of course, the value of the next number can also depend on two of its predecessors. For example, the famous Fibonacci sequence is defined as:
(1.2.4) |
where one can easily see:
(1.2.5) |
Similarly, we may generalize this recursion to
(1.2.6) |
where is any computable function that takes two variables as input. We can further generalize the notion of predictability to a sequence whose next value depends on say of its predecessors:
(1.2.7) |
The number of predecessors needed for the prediction is called the degree of the recursive prediction. The above expression (1.2.7) is also called an (auto) regression. Such a sequence is also called an auto-regressive sequence. If the function is a linear function, we call it a linear (auto) regression.
To simplify the notation, we may define a vector that collects consecutive values in the sequence:
(1.2.8) |
With this notation, the recursive relation (1.2.7) can be conveniently written as
(1.2.9) |
where the function is uniquely defined by the function in (1.2.7) and takes a -dimensional vector as input. In different contexts, such a vector is sometimes referred to as a “state” or a “token”. Note that the equation in (1.2.7) denotes a mapping , but the equation here is .
We may also define a predictable sequence that depends on another predictable sequence as input:
(1.2.10) |
where with is a (computable) predictable sequence. In other words, the next vector depends on both and . In the context of control theory, the sequence is often referred to as the “control input” and as the “state” or “output” of the system (1.2.10). One classic example is a linear dynamical system:
(1.2.11) |
which is widely studied in control theory [CD91].
Very often the control input is given by a computable function of the state itself:
(1.2.12) |
As a result, the sequence is given by composing the two computable functions and as:
(1.2.13) |
In this way, the sequence again becomes an auto-regressive predictable sequence. When the input depends on the output , we say the resulting sequence is produced by a “closed-loop” system (1.2.13). As the closed-loop system no longer depends on any external input, we say such a system has become autonomous. It can be viewed as a special case of auto-regression. For instance, if we choose in the above linear system (1.2.11), , the closed-loop system becomes
(1.2.14) |
which is a linear auto-regression.
Predictable sequences have natural counterparts in the continuous case. We may refer to them as predictable processes. Similar to the sequence of natural numbers, the simplest predictable continuous process is time itself .
More generally, we say a process, denoted by , is predictable if at any time , the value of the process at , where is an infinitesimal increment, is determined by its value at . Typically, the change in value is continuous and smooth. So is infinitesimally small. Predictable processes are typically described by (multivariate) differential equations:
(1.2.15) |
In the context of systems theory [CD91, Sas99], the above equation is also known as a state-space model. Similar to the discrete case, a controlled process can be given by:
(1.2.16) |
where is a computable input process.
For example in physics, Newton’s second law of motion describes how to predict the trajectory of a moving object under a force input :
(1.2.17) |
When there is no force , the above law reduces to a special case, known as Newton’s first law: the object maintains a constant speed in a straight line:
(1.2.18) |
for some constant velocity vector .
Now suppose you have observed or have been given many sequence segments:
(1.2.19) |
all from some predictable sequence . Without loss of generality, we may assume the length of each segment is . So each segment is of the form:
(1.2.20) |
for some . Then you are given a new segment and asked to predict its future values.
One difficulty here is that you normally do not know the function and the degree from which the sequence is generated:
(1.2.21) |
So the hope is somehow “to learn” and from the given sample segments . Hence the central task of learning to predict is:
Given many sampled segments of a predictable sequence, how to effectively and efficiently identify the function .
To identify the predictive function , we may notice a common characteristic of segments of any predictable sequence given by (1.2.21). If we take a long segment, say with a length , of the sequence and view it as a vector:
(1.2.22) |
Then the set of all such vectors is far from random and hence cannot possibly occupy the entire space of . Instead, they essentially have at most degrees of freedom – given the first entries of any , the values of the rest of the entries are uniquely determined. In other words, all lie on a -dimensional surface. In mathematics, such a surface is often called a submanifold, denoted as .
In practice, if we choose the segment length to be large enough, then all segments sampled from the same predicting function lie on a surface with an intrinsic dimension , significantly lower than that of the ambient space . For example, if the sequence is given by the following linear autoregression:
(1.2.23) |
for some constants . If we sample segments of length from such a sequence, then all samples lie on a two-dimensional plane or subspace in , as illustrated in Figure 1.6. If we can identify this two-dimensional subspace, the constants and in (1.2.23) can be fully determined.
It is easy to see that if the predicting function is linear, such as the case with the linear systems given in (1.2.11) and (1.2.14), the long segments always lie on a certain low-dimensional linear subspace. Identifying the predicting function is largely equivalent to identifying this low-dimensional subspace, a problem generally known as principal component analysis. We will discuss such classic models and methods in Chapter 2.
As it turns out, this is largely true for general predictable sequences as well: if one can identify the low-dimensional surface on which all the segment samples lie, then one can identify the associated predictive function .111111Under mild conditions, there is a one-to-one mapping between the low-dimensional surface and the function . This fact has been exploited in problems such as system identification which we will discuss later. We cannot over-emphasize the importance of this property of segments from a predictable sequence: All samples of long segments of a predictable sequence lie on a low-dimensional submanifold. As we will see in this book, all modern learning methods essentially exploit this property, implicitly or explicitly.
In real-world scenarios, the data we observe often do not come from a single predictable sequence. Typically they contain observations of multiple predictable sequences. For example, a video sequence can contain multiple moving objects. It is easy to see that in such scenarios, the data lie on a mixture of multiple low-dimensional linear subspaces or nonlinear submanifolds, as illustrated in Figure 1.7.
Of course, temporal correlation in predictable sequences is not the only reason data are low-dimensional. For example, the space of all images is humongous but most of the space consists of images that resemble structureless random images as shown in Figure 1.8 left. Natural images and videos, however, are highly redundant because there is a strong spatial and temporal correlation among all pixel values. This is the reason why we can easily recognize whether an image is noisy or clean, as shown in Figure 1.8 middle and right. Hence the distribution of natural images has a very low intrinsic dimension (relative to the total number of pixels of an image).
Due to the importance and ubiquity of the task of learning low-dimensional structures, the book “High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications” [WM22] has stated in the beginning with a statement: “The problem of identifying the low-dimensional structure of signals or data in high-dimensional spaces is one of the most fundamental problems that, through a long history, interweaves many engineering and mathematical fields such as system theory, signal processing, pattern recognition, machine learning, and statistics.”
Note that by enforcing the observed data point to be on a low-dimensional surface, we have essentially made the entries of very dependent on each other and in some sense have made the entries very “predictable” from values of other entries. For example, if we know that the data are constrained on a -dimensional surface in , then it allows us to do many useful things with the data besides prediction:
completion: in general, given more than entries of a typical sample , the rest of its entries usually can be uniquely determined.121212Prediction becomes a special case of this property.
denoising: suppose entries of a sample are perturbed by small noises, they can be effectively removed by projecting back onto the surface.
error correction: suppose a small number of unknown entries of are arbitrarily corrupted, they can be effectively and efficiently corrected.
Figure 1.9 illustrates these properties with a low-dimensional linear structure: a one-dimensional line in a two-dimensional plane.
In fact, under mild conditions, the above properties are generalizable to many other low-dimensional structures in high-dimensional spaces [WM22]. Interestingly, as we will see in this book, these useful properties such as completion and denoising will inspire effective methods to learn such low-dimensional structures.
In the above, for simplicity, we have only used the deterministic case to introduce the important notion of predictability and low-dimensionality. Hence, the data (or sampled segments) precisely lie on some geometric structures such as subspaces or surfaces. In practice, as we have alluded to before, there is always a certain level of uncertainty or randomness in the data. In this case, we may assume that the data have a probability distribution, with a probability density function . We say that a distribution is “low-dimensional” if its density concentrates around a geometric structure that is rather low-dimensional, say a subspace, a surface or a mixture of them, as shown in Figure 1.7. Notice that, from a practical point of view, such a density function , once learned, can serve as a very powerful prior for estimating based on partial, noising, or corrupted observation, say:
(1.2.24) |
by computing the conditional estimation or through sampling the conditional distribution .131313Modern generative AI technologies such as (conditioned) image generation very much rely on this fact, as we will elaborate on in Chapter 6.
Our discussions above have led to the main and only assumption on which this book will make for a deductive approach to understand intelligence and deep networks in particular:
The Main Assumption: Any intelligent systems or learning methods should and could rely on is that the world is predictable hence the distribution of the observed high-dimensional data samples have low-dimensional supports.
The remaining question is how to learn such low-dimensional structures correctly from the high-dimensional data, via effective and efficient computable means. As we will see in this book, parametric models that were well studied in classic analytical approaches and non-parametric models such as deep networks that are popular in modern practice are merely different means to achieve the same goal.
Note that even if a predictive function is tractable to compute, it does not imply that it is tractable or scalable to learn this function from a number of sampled segments. Of course, one classical approach to ensure the problems are tractable or amenable to efficient solutions is to make explicit assumptions about the family of low-dimensional structures we are dealing with. Historically, due to limited computation and data, simple and idealistic analytical models were always the first to be studied as they often offer efficient closed-form or numerical solutions. In addition, they can provide insights into the more general problems and they often already provide useful solutions to important though limited cases. In the old days when computational resources were scarce, analytical models that permitted efficient closed-form or numerical solutions were the only cases that could be implemented. Linear structures became the first classes of models to be thoroughly studied.
For example, arguably the simplest case is to assume the data is distributed around a single low-dimensional subspace in a high-dimensional space. Or somewhat equivalently, one may assume the data is distributed according to an almost degenerate low-dimensional Gaussian. Identifying such a subspace or Gaussian from a finite number of (noisy) samples is then the classical problem of principal component analysis (PCA) and effective algorithms have been developed for this class of models [Jol02]. One can make the family of models increasingly more complex and expressive. For instance, one may assume the data are distributed around a certain mixture of low-dimensional components (subspaces or low-dimensional Gaussians), as in the case of independent component analysis (ICA) [BJC85], dictionary learning (DL), generalized principal component analysis (GPCA) [VMS05], or the even more general class of sparse low-dimensional models that have been studied extensively in recent years in fields such as compressive sensing [WM22].
Around all these analytical model families, the central problem of study is always how to identify the most compact model within each family that best fits the given data. Below, we give a very brief account of these classical analytical models but leave a more systematic study to Chapter 2. In theory, these analytical models have provided us with tremendous insights into the geometric and statistical properties of low-dimensional structures. They often give us closed-form solutions or efficient and scalable algorithms which are very useful for data whose distributions can be well approximated by such models. More importantly, for more general problems, they provide us with a general sense of how easy or difficult the problem of identifying low-dimensional structures can be, and what the basic ideas are to approach such a problem.
As we have discussed before in Section 1.2.1, a main task of intelligence is to learn what is predictable in sequences of observations. Probably the simplest class of predictable sequences, or signals, are generated via a linear time-invariant (LTI) process:
(1.3.1) |
where is the input and is the impulse response function.141414Normally is assumed to have certain nice structures, say finite length or banded spectrum. Here is some additive noise in the observations. The problem is that given the input process and observations of the output process , how to find the optimal such that predicts in an optimal way. In general, we measure the goodness of the prediction by the minimum mean squared error (MMSE):
(1.3.2) |
The optimal solution is referred to as a (denoising) filter. Norbert Wiener, the same person who initiated the Cybernetics movement, studied this problem in the 1940s and gave an elegant closed-form solution known as the Wiener filter [Wie42, Wie49]. This became one of the most fundamental results in the field of Signal Processing.
The idea of denoising or filtering for a dynamic process was later extended to a linear time-invariant system described by a (finite-dimensional) state-space model by Rudolph Kalman in the 1960s:
(1.3.3) |
The problem is how we can estimate the state of the system from noisy observations of the form:
(1.3.4) |
where is some (white) noise. The optimal causal151515Which means the estimation can only use observations up to the current time step . Kalman filter is always causal whereas Wiener filter need not be. state estimator that minimizes the MMSE-type prediction error
(1.3.5) |
is given in closed-form by the so-called Kalman filter [Kal60]. This is one of the cornerstones of modern Control Theory because it allows us to estimate the state of a dynamical system from its noisy observations. Then one can subsequently introduce a (linear) state feedback, say of the form , and make the closed-loop system fully autonomous, as we saw in equation (1.2.13).
To derive the Kalman filter, the system parameters are assumed to be known. If they are not given in advance, it would be a more challenging problem known as system identification: how to learn the parameters from (many samples of) the input sequence and observation sequence . This is a classic problem in systems theory. If the system is linear, it can be shown that the input and output sequences would jointly lie on a certain low-dimensional subspace161616which has the same dimension as the order of the state-space model (1.3.3). . Hence the identification problem is essentially equivalent to identifying this low-dimensional subspace [VM96, LV09, LV10].
Note that the above problems have two things in common: first, the (noise-free) sequences or signals are assumed to be generated by an explicit family of parametric models; second, these models essentially are all linear. So conceptually, let be a random variable whose “true” distribution is supported on a low-dimensional linear subspace, say . To a large extent, Wiener filter and Kalman filter all try to estimate such an from its noisy observations:
(1.3.6) |
where is typically a random Gaussian noise (or process). Hence, essentially, their solutions all rely on identifying a low-dimensional linear subspace that best fits the observed (noisy) data. Then by projecting the data onto this subspace, one obtains the optimal denoising operations, all in closed form.
From the above problems in classical signal processing and system identification, we see that the main task behind of all these problems is to learn from noisy observations a single low-dimensional linear subspace. Mathematically, we may model such a structure as:
(1.3.7) |
where is some small random noise. Figure 1.10 illustrates such a distribution with two principal components.
The problem is to find the subspace basis from many samples of . A typical approach to estimate the subspace is to minimize the variance of the noise, also known as the minimum mean square error (MMSE):
(1.3.8) |
Notice that this is essentially a denoising task: once the basis is correctly found, we can denoise the noisy sample by projecting it onto the low-dimensional subspace spanned by as
(1.3.9) |
If the noise is small and if we learned the correct low-dimensional subspace , we should expect . That is, PCA is a special case of the auto-encoding:
(1.3.10) |
Only here because of the simple data structure, the encoder and decoder both become simple linear operators ((projecting and lifting).
This is a classic problem in statistics known as the Principal Component Analysis (PCA). It was first studied by Pearson in 1901 [Pea01] and later independently by Hotelling in 1933 [Hot33]. This topic is systematically summarized in the classic book [Jol86, Jol02]. In addition, one may explicitly assume the data is distributed according to a single low-dimensional Gaussian:
(1.3.11) |
which is equivalent to assuming that in the above PCA model (1.3.7) is a standard normal distribution. This is known as Probabilistic PCA [TB99].
In this book, we will revisit PCA in Chapter 2, from the perspective of learning a low-dimensional distribution. Our goal is to use this simple and idealistic model to convey some of the most fundamental ideas for learning a compact representation for a low-dimensional distribution, including the important notion of compression via denoising and autoencoding for a consistent representation.
Independent component analysis (ICA) was originally proposed by [BJC85] as a classic model for learning a good representation. In fact it was originally proposed as a simple mathematical model for our memory. The ICA model takes a deceivingly similar form as the above PCA model (1.3.7) by assuming that the observed random variable is a linear superposition of multiple independent components :
(1.3.12) |
However, here the components are assumed to be independent non-Gaussian variables. For example, a popular choice is
(1.3.13) |
where is a Bernoulli random variable and could be a constant value or another random variable, say Gaussian.171717Even if is Gaussian, is no longer a Gaussian variable! The ICA problem is trying to identify both and from observed samples of . Figure 1.11 illustrates the difference between ICA and PCA.
Although the (decoding) mapping from to seems linear and easy once and are learned, the (encoding) mapping from to can be complicated and may not be represented by a simple linear mapping. Hence ICA generally gives an autoencoding of the form:
(1.3.14) |
Hence, unlike PCA, ICA is a little more difficult to analyze and solve. In the 1990s, researchers like Erkki Oja and Aapo Hyvärinen [HO97, HO00a] made significant theoretical and algorithmic contributions to ICA. In Chapter 2, we will study and give a solution to ICA from which the encoding mapping will become clear.
As one may see, if in (1.3.13) is very small, the probability that any of the components is non-zero is small. In this case, we say is sparsely generated and it concentrates on a set of linear subspaces of dimension . Hence, to some extent, we may extend the above ICA model to a more general family of low-dimensional structures known as sparse models.
A -sparse model is defined as consisting of the set of all -sparse vectors:
(1.3.15) |
where is the -norm that counts the number of non-zero entries in a vector . That is, is the union of all -dimensional subspaces that align with the coordinate axes, as illustrated in Figure 1.7 left. One important problem in classic signal processing and statistics is how to recover a sparse vector from its linear observations:
(1.3.16) |
where is given but typically and is some small noise. This seemingly benign problem turns out to be NP-hard to compute and it is even hard to approximate (see the book [WM22] for details).
So despite a very rich and long history of study which can be dated back as early as the 18th century [Bos50], there was no provably efficient algorithm to solve this class of problems, although many heuristic algorithms have been proposed and developed between the 1960s and the 1990s. Some are rather effective in practice but without any rigorous justification. A major breakthrough came in the early 2000s when a few renowned mathematicians including David Donoho, Emmanuel Candès, and Terence Tao [Don05, CT05a, CT05] established a rigorous theoretical framework that allows us to characterize precise conditions under which the sparse recovery problem can be solved effectively and efficiently, say via a convex minimization:
(1.3.17) |
where is the sparsity-promoting norm of a vector and is some small constant. Any solution to this problem essentially gives a sparse coding mapping:
(1.3.18) |
We will give a brief account of such an algorithm, hence mapping, in Chapter 2 which will reveal interesting fundamental connections between sparse coding and deep learning.181818Although similarities between algorithms for sparse coding and deep networks were noticed as early as in 2010 [GL10].
As it turns out, conditions under which minimization succeeds are surprisingly general. The minimum number of measurements required for a successful recovery is only proportional to the intrinsic dimension of the data . This is now known as the compressive sensing phenomenon [Can06]. Moreover, this phenomenon is not unique to sparse structures. It also applies to very broad families of low-dimensional structures such as low-rank matrices etc. These results have fundamentally changed our understanding about recovering low-dimensional structures in high-dimensional spaces. This dramatic reversal of fortune with high-dimensional data analysis was even celebrated as the “blessing of dimensionality” by David Donoho [D ̵D00] as opposed to the typical pessimistic belief in the “curse of dimensionality” for high-dimensional problems. This coherent and complete body of results has been systematically organized in the book [WM22].
From a computational perspective, one cannot overestimate the significance of this new framework. It has fundamentally changed our views about an important class of problems previously believed to be largely intractable. It has enabled us to develop extremely efficient algorithms that scale gracefully with the dimension of the problem, hence making the problem of sparse recovery from:
(1.3.19) |
These algorithms also come with rigorous theoretical guarantees of their correctness given precise requirements in data and computation. The rigorous and precise nature of this approach is almost opposite to that of practicing deep neural networks, which is largely empirical. Yet, despite their seemingly opposite styles and standards, we now know both approaches share a common goal: trying to pursue low-dimensional structures in high-dimensional spaces.
Conceptually, an even harder problem than the sparse coding problem (1.3.16) is when the observation matrix is not even known in advance and we need to learn from a set of (possibly noisy) observations, say :
(1.3.20) |
Here we are given only but not the corresponding nor the noise term , except for knowing ’s are sparse. This is known as the dictionary learning problem, which can be viewed as a generalization of the ICA problem (1.3.12) discussed earlier. In other words, given the distribution of the data to be the image of a standard sparse distribution under a linear transform , we would like to learn and its “inverse” mapping such that we obtain an autoencoder:
(1.3.21) |
One can see what is in common with PCA, ICA, and Dictionary Learning is that they all assume that the distribution of the data is supported around low-dimensional linear or mixed linear structures. They all require learning a (global or local) basis of the linear structures, from probably noisy samples of the distribution. In Chapter 2, we will study how to identify low-dimensional structures through these classical models. In particular, we will see an interesting and important fact: all these low-dimensional (piecewise) linear models can be learned effectively and efficiently by the same type of fast algorithms, known as power iteration [ZMZ+20]. Although the above linear or mixed linear models are somewhat too simplistic or idealistic for most real-world data, understanding these models is an important first step toward understanding more general low-dimensional distributions.
The distributions of real-world data such as images, videos, and audio are too complex to be modeled by the above, somewhat idealistic, linear models or Gaussian processes. We normally do not know a priori they are generated from which family of parametric models.191919Although in history there had been many attempts to develop analytical models for these data, such as random fields or stochastic processes for imagery data [MG99], as we have discussed in the previous section. In practice, we typically only have many samples from their distributions – the empirical distributions. Obviously, in such cases, we normally cannot expect to have any closed-form solutions for their low-dimensional structures, nor for the resulting denoising operators.202020Unlike the cases of PCA, Wiener filter, and Kalman filter. So we need to develop a more general solution to these empirical distributions, not necessarily in closed form but at least efficiently computable. If we did this correctly, solutions to the aforementioned linear models should become their special cases.
In the 1950s, statisticians became interested in the problem of denoising data drawn from an arbitrary distribution. Let be a random variable with probability density function . So if we observe a noisy version of :
(1.3.22) |
where is standard Gaussian noise and is the noise level of the observation. Let be the probability density function of ,212121That is, , where is the density function of the Gaussian distribution . Amazingly, the posterior expectation of given can be calculated by an elegant formula, known as Tweedie’s formula [Rob56]:222222Herbert Robbins gave the credit for this formula to Maurice Kenneth Tweedie from their personal correspondence.
(1.3.23) |
As can be seen from the formula, the function plays a very special role in denoising the observation here. The noise can be explicitly estimated as
(1.3.24) |
for which we only need to know the distribution of not the ground truth for . An important implication of this result is that if we add Gaussian noise to any distribution, the denoising process can be done easily if we can somehow get hold of the function .
Because this is such an important and useful result, it has been rediscovered and used in many different contexts and areas. For example, after Tweedie’s formula [Rob56], it was rediscovered a few years later by [Miy61]. In the early 2000s, the function was rediscovered again in the context of learning a general distribution and was named the “score function” by Aapo Hyvärinen [Hyv05]. But its connection to (empirical Bayesian) denoising was soon recognized by [Vin11]. Generalizations to other measurement distributions (beyond Gaussian noise) have been made by Eero Simoncelli’s group [RS11], and later applied to image denoising [KS21, HJA20].
In fact, this function has a very intuitive information-theoretic and geometric interpretation. Note that in information theory generally corresponds to the number of bits needed to encode 232323at least in the case of a discrete variable, as we will explain in more detail in Chapter 3.. The gradient points to a direction in which the density is higher, as shown in Figure 1.12 left. The number of bits needed to encode decreases if it moves in that direction. Hence, the overall effect of the operator is to push the distribution to “shrink” towards areas of higher density. Actually, one can formally show that the (differential) entropy of the distribution
(1.3.25) |
indeed decreases under such an operation (see Chapter 3 and Appendix B). Therefore, if we encode it with an optimal code book, the overall coding length/rate of the resulting distribution is reduced, hence more “compressed.”. Intuitively, one can imagine that, if we repeat such a denoising process indefinitely, the distribution will eventually shrink to one whose mass is concentrated on a support of lower dimension. For example, the distribution shown on the left of Figure 1.12, under the action of the score function , eventually will converge to the distribution on the right242424Strictly speaking, is a distribution whose density is a generalized function: , with . :
(1.3.26) |
Strictly speaking, as the distribution converges to , its differential entropy converges to negative infinity. This is due to a technical difference between the definition of differential entropy for continuous random variables and discrete random variables, respectively. We will see how to resolve this technical difficulty in Chapter 3, using a more uniform measure of rate distortion.
We will discuss later in this chapter and in Chapter 3 how such a seemingly simple concept of denoising and compression leads to a very unifying and powerful method for learning general low-dimensional distributions in a high-dimensional space, including the distribution of natural images.
In practice, for many important real-world data such as images, sounds, and texts, it is difficult to model them with idealistic linear or mixed linear models. For example, there has been a long and rich history in the fields of image processing and computer vision that tries to model the distribution of natural images analytically. David Mumford, a Fields Medalist, spent considerable effort in the 1990s trying to understand and model the statistics of natural images [Mum96]. He and his students, including Song-Chun Zhu, drew inspiration and techniques from statistical physics and proposed many statistical and stochastic models for the distribution of natural images [ZWM97, ZM97a, ZM97, HM99, MG99, LPM03]. However, these analytical models met with limited success in producing samples that closely resemble natural images. Obviously, for real-world data like images, we need to develop more powerful and unifying methods to pursue their more general low-dimensional structures.
Hence, historically, many empirical models have been proposed to model important real-world data, including images and texts. These models often drew inspiration from the characteristics of the biological nerve system because the brain of an animal or human seems to process these data extremely efficiently and effectively.
Inspired by the nerve system in the brain, the mathematical model of the first artificial neuron252525known as the Linear Threshold Unit, or a perceptron. was proposed by Warren McCulloch262626A professor of psychiatry at the University of Chicago at the time and Walter Pitts in 1943 [MP43]. It describes the relationship between the input and output as:
(1.3.27) |
where is some nonlinear activation, normally modeled by a threshold function. This model is illustrated in Figure 1.13. As we can see, this form already shares the main characteristics of a basic unit in modern deep neural networks. The model is derived from observations of how a single neuron works in our nerve system. However, people did not know exactly what functions a collection of such neurons wants to realize and perform. On a more technical level, neither were they sure what nonlinear activation function should be used. Hence, historically many variants have been proposed.272727Step function, Hard or soft thresholding, Rectified Linear Unit (ReLU), sigmoid, etc.
In the 1950s, Frank Rosenblatt was the first to build a machine with a network of such artificial neurons, shown in Figure 1.15. The machine is called Mark I Perceptron which consists of an input layer, an output layer, and a single hidden layer consisting of 512 artificial neurons, as shown in Figure 1.15 left, which is similar to what is illustrated in Figure 1.14 left. It was designed to classify optical images of letters. However, the capacity of a single-layer network is limited and can only learn linearly separable patterns. In a 1969 book Perceptrons: An Introduction to Computational Geometry by Marvin Minsky and Seymour Papert [MP69], it was shown that the single-layer architecture of Mark I Perceptron cannot learn an XOR function. This result had significantly damped people’s interest in artificial neural networks, even though it was later proven that a multi-layer network is able to learn an XOR function [RHW86]. In fact, a (big enough) multi-layer network, as shown in Figure 1.14 right, consisting of such simple neurons can simulate any finite-state machine, even the universal Turing machine.282828Do not confuse what neural networks are capable of doing in principle with whether it is tractable or easy to learn a neural network that realizes certain desired functions. Nevertheless, subsequently, the study of artificial neural networks went into its first winter in the 1970s.
The somewhat disappointing early experimentation with artificial neural networks like Mark I Perceptron in the 50s and 60s suggested that it might not be enough to simply connect neurons in a general fashion as multi-layer perceptrons (MLP). In order to build effective and efficient networks, we need to understand what purpose or function neurons in a network need to achieve collectively so that they should be organized and learned in a certain special way. Once again, the study of machine intelligence turned to drawing inspiration from how the animal’s nervous system works.
It is known that most of our brain is dedicated to processing visual information. In the 1950s and 1960s, David Hubel and Torsten Wiesel systematically studied the visual cortices of cats. It was discovered that the visual cortex contains different types of cells (known as simple cells and complex cells), which are sensitive to visual stimuli of different orientations and locations [HW59]. Hubel and Wiesel won the 1981 Nobel Prize in Physiology or Medicine for their ground-breaking discovery.
On the artificial neural network side, Hubel and Wiesel’s work had inspired Kunihiko Fukushima who designed the “neocognitron” network in 1980 which consists of artificial neurons that emulate biological neurons in the visual cortices [Fuk80]. This is known as the first convolutional neural network (CNN), and its architecture is illustrated in Figure 1.16. Unlike the perceptron, the neocognitron had more than one hidden layer and could be viewed as a deep network, as compared in Figure 1.14 right.
Also inspired by how neurons work in the cat’s visual cortex, he was also the first to introduce the use of rectified linear unit (ReLU):
(1.3.28) |
for the activation function in 1969 [Fuk69]. But not until recent years has the ReLU become a widely used activation function in modern deep (convolutional) neural networks. We will learn from this book why this is a good choice once we explain the main operations that deep networks try to implement: compression.
CNN-type networks continued to evolve in the 1980s and many different variants had been introduced and studied. However, despite the remarkable capacities of deep networks and the improved architectures inspired by neuroscience, it remained extremely difficult to train such deep networks for a real task such as image classification. How to get a network to work depended on many unexplainable heuristics and tricks that really limited the appeal and applicability of neural networks. One major breakthrough came around 1989 when Yann LeCun successfully used back propagation (BP) to learn a deep convolutional neural network for recognizing hand-written digits [LBD+89], later known as the LeNet (see Figure 1.17). After several years’ persistent development, his perseverance paid off: performance of the LeNet eventually became good enough for practical usage in the late 1990s [LBB+98]: it was used by the US Post Office for recognizing handwritten digits (for zip codes). The LeNet was considered as the “prototype” network for all modern deep neural networks, such as the AlexNet and ResNet, which we will discuss later. Due to this work, Yann LeCun was awarded the 2018 Turing Award.292929Together with two other pioneers of deep networks, Yoshua Bengio and Geoffrey Hinton.
In history, the fate of deep neural networks seems to be tied closely to how they can be trained easily and efficiently. Back propagation (BP) was introduced for this reason. We know that a multiple layer perceptron can be expressed as a composition of a sequence of linear mappings and nonlinear activations as follows:
(1.3.29) |
In order to train the network weights to reduce the prediction/classification error based on a gradient descent algorithm, we need to evaluate the gradient . It has been well known, from the chain rule in calculus, that gradients can be computed efficiently for this type of functions, later known as back propagation (BP). See Appendix A for a detailed description. The technique of back propagation was already known and practiced by people in fields such as optimal control and dynamic programming in the 1960s and 1970s. For example, it appeared in the 1974 PhD thesis of Dr. Paul Werbos [Wer74, Wer94]. In 1986, David Rumelhart et al. were the first to apply back propagation to train a multiple layer perceptron (MLP) network [RHW86]. Since then, BP has become increasingly popular since it gives a scalable algorithm to learn large deep neural networks.303030as it can be efficiently implemented on computing platforms that facilitate parallel and distributed computing. It is now an almost dominant technique for training deep neural networks today. However, it is believed that nature does not learn by back propagation, because such a mechanism is still way too expensive for physical implementation by nature313131As we have discussed earlier, nature almost ubiquitously learns to correct error via closed-loop feedback.. This obviously leaves tremendous room for improvement in the future, as we will discuss more.
However, despite the above algorithmic progress and promising practice in the 1980s, training deep neural networks remained extremely finicky and expensive for computing systems in the 1980s and 1990s. In the late 1990s, support vector machines (SVM) [CV95] had become very popular as they were viewed as a better alternative to neural networks for tasks such as classification.323232In fact, similar ideas to solve classification problems can be traced back to the PhD dissertation work of Thomas Cover, which as condensed and published in a paper in 1964 [Cov64]. There were two main reasons: first, SVM was based on a rigorous statistical learning framework known as the Vapnik–Chervonenkis (VC) theory; and second, it leads to rather efficient algorithms based on convex optimization [BV04]. The rise of SVM had brought a second winter to the study of neural networks around the early 2000s.
In the late 1980s and 1990s, artificial neural networks were already adopted to learn low-dimensional representations of high-dimensional data such as images. It had been shown that neural networks can be used to learn PCA from the data [Oja82, BH89], instead of using the classic methods discussed in Section 1.3.1. It was also argued during the late 1980s that due to its capability to model nonlinear transforms, neural networks were suggested to learn low-dimensional representations for data with nonlinear distributions. Similar to the linear PCA case, one can try to simultaneously learn a nonlinear dimension-reduction encoder and a decoder , each modeled by a deep neural network [RHW86, Kra91]:
(1.3.30) |
By enforcing the decoded data to be consistent with the original , say by minimizing an MMSE type reconstruction error333333Although MMSE type of errors are known to be problematic to imagery data that have complex nonlinear structures. As we will soon discuss, much of the recent work in generative methods, including GANs, has been to find surrogates of better distance functions between the original data and the regenerated .:
(1.3.31) |
an autoencoder can be learned from the data themselves.
But how can we guarantee that such an autoencoding indeed captures the true low-dimensional structures in instead of giving a trivial redundant representation? For example, we can simply choose and to be the identity map and . So to ensure the autoencoding to be worthwhile, one wishes the resulting representation to be compressive, in terms of a certain computable measure of complexity. In 1993, Geoffrey Hinton and colleagues proposed to use the coding length as such a measure, hence the objective of autoencoding became to find such a representation that minimizes the coding length [HZ93]. This work also established a fundamental connection between the principle of minimum description length [Ris78] and free (Helmholtz) energy minimization. Later work [HS06] from Hinton’s group showed empirically that such an autoencoding is capable of learning meaningful low-dimensional representations for real-world images. A more comprehensive survey of autoencoders was done by Pierre Baldi in 2011 [Bal11], just before deep networks became popular. We will discuss more about the measure of complexity and autoencoding later in Section 1.4, and give a systematic study of compressive autoencoding in Chapter 5, with a more unifying view.
For nearly 30 years, from the 1980s to the 2010s, for the study of machine learning and machine intelligence, neural networks were not considered seriously by the mainstream. Early (deep) neural networks, such as the LeNet, have shown promising performance for small-scale classification problems such as recognizing digits. However, the design and practice of the networks were rather empirical, datasets available at the time were small, and the BP algorithm was a huge computational burden for computers then. These factors had resulted in the lack of interest in neural networks, and progress had been stagnant, with only a handful of researchers working on it.
As it turns out, the tremendous potential of deep neural networks could only be unleashed once there are enough data and computing power. Fast forward to the 2010s, much larger datasets such as ImageNet became available, and GPUs became powerful enough to make BP much more affordable, even for networks much larger than LeNet. Around 2012, a deep convolutional neural network known as the AlexNet drew attention as it surpassed existing classification methods by a significant margin with the ImageNet dataset [KSH12].343434In fact, before this, deep networks had demonstrated state-of-the-art performance on speech recognition tasks. But it did not receive so much attention until their success with image classification. Figure 1.18 shows the comparison between the AlexNet and the LeNet. The AlexNet shares many common characteristics as the LeNet, only it is bigger and has adopted ReLU as the nonlinear activation instead of the Sigmoid function used in LeNet. Partly due to the influence of this work, Geoffrey Hinton was awarded the 2018 Turing Award.
This early success inspired the machine intelligence community, in the next few years, to explore new variations and improvements to the network design. In particular, people had discovered empirically that the larger and deeper the networks, the better the performance in tasks such as image classification. Many deep network architectures have been tried, tested, and popularized. A few notable ones include VGG [SZ15], GoogLeNet [SLJ+14], ResNet [HZR+16], and more recently Transformers [VSP+17] etc. Despite fast progress in empirical performance, there was a lack of theoretical explanation for these empirically discovered architectures, including relationships among them if any. One purpose of this book is to reveal what common objective all these networks may serve and why they share certain common characteristics, including multiple layers of linear operator interleaved with nonlinear activation (see Chapter 4).
The early successes of deep networks were mainly for classification tasks in a supervised learning setting, such as speech recognition and image recognition. Deep networks were later adopted by the DeepMind team, led by Demis Hassabis, to learn decision-making or control policies for playing games. In this context, deep networks are used to model the optimal decision/control policy or the associated optimal value function, as shown in Figure 1.19. These network parameters are incrementally optimized 353535Say based on back propagation (BP). based on reward returned from the success or failure of playing the game with the current policy. This learning method is generally referred to as reinforcement learning [SB18], originated from the practice of control systems in the late 1960s [WF65, MM70]. Its earlier roots can be traced back to a much longer and richer history of dynamic programming by Richard Bellman [Bel57] and trial-and-error learning by Marvin Minsky [Min54] in the 1950s.
From an implementation perspective, the combination of deep networks and reinforcement learning turned out to be rather powerful: deep networks can be used to approximate control policy and value function for real-world environments that are difficult to model analytically. This practice eventually led to the AlphaGo system, developed by the company DeepMind, which surprised the world in 2016 by beating a top human player Lee Sedol in the game Go and then the world champion Ke Jie in 2017.363636In 1996, IBM’s Deep Blue system made history by defeating Russian grandmaster Garry Kasparov in chess. It mainly used traditional machine learning techniques, such as tree search and pruning, which were not so scalable and had not been proven successful for more challenging games such as Go for over 20 years.
The success of AlphaGo came as a big surprise to the computing society which generally believes that the state space for search is too prohibitively large to admit any efficient solution, in terms of computation and sample size. The only reasonable explanation for its success is that there must be very good structures in the optimal value/policy function of the game Go. Their intrinsic dimension is not so high and they can be approximated well by a neural network, learnable from not so prohibitively many samples.
One may view early practices of deep networks in the 2010s as focused more on extracting relevant information from the data and encoding it for certain task-specific representation (say represents class labels in classification tasks):
(1.3.32) |
In this setting, the mapping to be learned does not need to preserve most distributional information about but only the sufficient statistics needed for a specific task. For example, a sample in could be an image of an apple, which is mapped by to its class label “apple.” The information bottleneck framework [TZ15] was proposed in 2015 to analyze the role of deep networks in such a context.
However, in many modern situations such as those so-called large foundation models, people often need to decode to recover the corresponding to a certain degree of precision:
(1.3.33) |
As typically represents data observed from the external world, a good decoder would allow us to simulate or predict what happens in the world. For example, in a “text to image” or “text to video” task, normally represents the texts that describe the content of a desired image . The decoder should be able to generate an that has the same content as . For example, given an object class “apple”, the decoder should generate an image that looks like an apple, although not necessarily exactly the same as the original .
In order for the generated images to be similar to the true natural images , we need to be able to evaluate and minimize some distance:
(1.3.34) |
As it turns out, most theoretically motivated distances are extremely difficult, if not impossible, to compute and optimize for distributions in high-dimensional space but with a low intrinsic dimension.373737This is the case even if a parametric family of distribution for is given. The distance often becomes ill-conditioned or ill-defined for distributions with low-dimensional supports. What could be even worse is that the chosen family might not be able to approximate well the true distribution of interest.
In 2007, Zhuowen Tu, a former student of Song-Chun Zhu, probably disappointed by early analytical attempts to model and generate natural images (discussed earlier), decided to try a drastically different approach. In a paper published in CVPR 2007 [Tu07], he was the first to suggest that one could learn a generative model for images via a discriminative approach. The idea is simple: if it is difficult to evaluate the distance , one could try to learn a discriminator to separate from :
(1.3.35) |
where indicate an image is generated or not. Intuitively, the harder we could separate and , probably the closer they are.
Tu’s work [Tu07] was the first to demonstrate the feasibility of learning a generative model from a discriminative approach. However, the work adopted traditional methods to generate images and classify distributions (such as boosting), and they were slow and hard to implement. After 2012, deep neural networks became very popular for image classification. In 2014, Ian Goodfellow and colleagues proposed again to generate natural images with a discriminative approach [GPM+14]. They suggested using deep neural networks to model the generator and the discriminator instead. Moreover, they proposed to learn the generator and discriminator via a minimax game:
(1.3.36) |
where is some natural loss function associated with the classification. In words the discriminator tries to maximize its success in separating and , whereas the generator tries to do the opposite. For this reason, this approach is named as generative adversarial networks (GANs). It was shown that once trained on a large dataset, GANs can indeed generate photo-realistic images. Partially due to the influence of this work, Yoshua Bengio was awarded the 2018 Turing Award.
The discriminative approach seems to be a rather clever way to bypass a fundamental difficulty in distribution learning. However, rigorously speaking, this approach does not fully resolve this fundamental difficulty at all. It is shown by [GPM+14] that with a properly chosen loss, the minimax formulation is mathematically equivalent to minimize the Jensen-Shannon distance (see [CT91]) between and . This is known to be a hard problem for two low-dimensional distributions in a high-dimensional space. As a result, in practice, GANs typically rely on many heuristics and engineering tricks and often suffer from instability issues such as mode collapsing.383838Nevertheless, such a minimax formulation provides a practical approximation of the distance. It simplifies the implementation and avoids certain caveats in directly computing the distance. Overall, there has been a lack of theoretical guidance on how to improve the practice of GANs.
In 2015, shortly after GAN was introduced and became popular, Surya Ganguli and his students realized and suggested that an iterative denoising process modeled by a deep network can be used to learn a general distribution, such as that of natural images [SWM+15]. Their method was inspired by properties of the special Gaussian and binomial processes, studied by William Feller back in 1949 [Fel49].393939Again, in the magical era of 1940s!
Soon, denoising operators based on the score function [Hyv05], briefly introduced in Section 1.3.1, were shown to be more general and unified the denoising and diffusion processes and algorithms [SE19, SSK+21, HJA20]. Figure 1.20 gives an illustration of the process that transforms a generic Gaussian distribution to an (arbitrary) empirical distribution by performing a sequence of iterative denoising (or compressing) operations:
(1.3.37) |
By now, denoising (and diffusion) have replaced GANs and become a mainstream method for learning distributions of images and videos, leading to popular commercial image generating engines such as Midjourney and Stability.ai. In Chapter 3 we will systematically introduce and study the denoising and diffusion method for learning a general low-dimensional distribution.
So far, we have given a brief account of the main objective and history of machine intelligence and many important ideas and approaches associated with it. In recent years, after the empirical success of deep neural networks, tremendous efforts have been made to develop theoretical frameworks that can help us understand all the empirically designed deep neural networks, either certain seemingly necessary components (e.g., dropout, normalization, attention, etc.) or their overall behaviors (e.g., double descent, neural collapse, etc.).
Partly motivated by this, this book aims to achieve several important and challenging goals:
Develop a theoretical framework that would allow us to derive rigorous mathematical interpretation of deep neural networks.
Ensure correctness of the learned data distribution and consistency with the learned representation.
Demonstrate that the framework can lead to performant architectures and can guide further improvements in practice.
Within the past few years, there is mounting evidence that these goals can indeed be achieved, by leveraging the theory and solutions to the classical analytical low-dimensional models discussed briefly earlier (more thoroughly in Chapter 2) and by integrating fundamental ideas from a few related fields, namely information/coding theory, control/game theory, and optimization. This book aims to give a systematic introduction to this new approach.
One necessary condition for any learning task to be possible is that the sequences of interest must be computable, at least in the sense of Alan Turing [Tur36]. That is, a sequence can be computed via a program on a typical computer.404040There are indeed well-defined sequences that are not computable. In addition to being computable, we require computation be tractable.414141We do not need to consider predicting things whose computational complexity is intractable, say grows exponentially in the length or dimension of the sequence. That is, the computational cost (space and time) for learning and computing the sequence should not grow exponentially in length. Furthermore, as we see in nature (and in modern practice of machine intelligence), for most practical tasks, an intelligent system needs to learn what is predictable from massive data in a very high-dimensional space, such as from vision, sound, and touch. Hence, for intelligence, we do not need to consider all computable and tractable sequences or structures. We should focus only on predictable sequences and structures which admit scalable realizations of its learning and computing algorithms:
(1.4.1) |
This is because whatever algorithms intelligent beings use to learn useful information must be scalable. More specifically, the computational complexity of the algorithms would better scale gracefully, typically linear or even sublinear, in the size and dimension of the data. On the technical level, this requires that the operations that the algorithms rely on to learn could only utilize oracle information that can be efficiently computed from the data. More specifically, when the dimension is high and the scale is large, the only oracle one could afford to compute is either the first-order geometric information about the data424242such as approximating a nonlinear structure locally with linear subspaces and computing the gradient of an objective function. or the second-order statistic information434343such as covariance or correlation of the data or their features.. The main goal of this book is to develop a theoretical and computational framework within which we can systematically develop efficient and effective solutions or algorithms with such scalable oracles and operations to learn low-dimensional structures from the sampled data and subsequently the predictive function.
From the examples of sequences we gave in Section 1.2.1, it is clear that some sequences are easy to model and compute and others are more difficult. Obviously, the computational cost of a sequence depends on how complex the predicting function is. The higher the degree of regression , the more costly it is to compute. can be a simple linear function, and it can also be a nonlinear function that can be arbitrarily difficult to specify and compute.
It is reasonable to believe that if a sequence is harder, by whatever measure we may choose, to specify and compute, then it will also be more difficult to learn from its sampled segments. Nevertheless, for any given predictable sequence, there are in fact many, often infinitely many, ways to specify it. For example, for a simple sequence , we could also define the same sequence with Hence it would be very useful if we could develop an objective and rigorous notion of “complexity” for any given computable sequence.
Andrey Kolmogorov, a Russian mathematician, was one of the first to give a definition of complexity for any computable sequence.444444Many have contributed to this notion of sequence complexity, most notably including Ray Solomonoff and Greg Chaitin. All three are believed to have developed and studied algorithmic information theory independently, Ray Solomonoff in 1960, Andrey Kolmogorov in 1965 [Kol98] and Gregory Chaitin around 1966 [Cha66]. He suggested that among all programs that can compute the same sequence, we may use the length of the shortest program as a measure for its complexity. This idea is very much in line with the famous “Occam’s Razor” principle of parsimony: one always chooses the simplest among all theories that can explain the same observation. To be more precise, let represent a compute program that can generate a sequence on a universal computer . The Kolmogorov complexity of the sequence is defined to be:
(1.4.2) |
Therefore, the complexity of a sequence is measured by how “parsimonously” we can specify or compute it. This definition of complexity, or parsimony, for sequences is of great conceptual importance and has many interesting properties. Historically, it has inspired many profound studies in the theory of computation, especially in the field of algorithmic information theory.
The length of the shortest program can be viewed as the ultimate compression of the sequence considered, providing a quantitative measure of how much we have gained by having learned the correct generative mechanism of the sequence. However, despite its theoretical importance, the Kolmogorov complexity is in general not a computable function [CT91] (even intractable to approximate accurately). As a result, this measure of complexity is of little practical use. Neither can it tell us in advance how difficult it is to learn a given sequence nor can it tell us how well we have learned.
Hence for practical purposes, we need an efficiently computable measure of complexity for sequences that are generated from the same predicting function.454545Note that in practice, we typically care about learning the predicting function , instead of any particular sequence generated by . Note that part of the reason why Kolmogorov complexity is not computable is because its definition is non-constructive.
So to introduce a computable measure of complexity, we may take a more constructive approach, as advocated by Claude Shannon through the framework of information theory [Sha48, CT91].464646which has successfully guided the engineering practice of the communication industry for the past over 80 years. In essence, by assuming the sequence is drawn from a probabilistic distribution , the so-called entropy of the distribution:474747Here we consider differential entropy as we assume the sequence consists of continuous variables. If it consists of discrete variables, we could consider the entropy:
(1.4.3) |
gives a natural measure of its complexity. This measure also has a natural interpretation of the average number of binary bits needed to encode such a sequence, as we will see in Chapter 3.
To illustrate the main ideas of this view, let us take a large number of long sequence segments:
(1.4.4) |
generated by a predicting function . Note that without loss of generality, here we have assumed that all sequences are of the same length . Therefore, each sequence can be viewed as a vector in . Secondly, we may introduce a coding scheme (with a code book), denoted as , that converts every segment to a unique stream of binary bits . The simplest coding scheme is to fill the space spanned by all segments with -balls, as shown in Figure 1.21. We then number all the balls. Each sequence is encoded as the (binary) number of its closest ball. Hence, each segment can be recovered484848up to precision as such an encoding scheme is lossy. from its corresponding bit stream.
Then the complexity of the predicting function can be evaluated as the average coding length of all sequences, known as the coding rate:494949One may make this more precise by taking to be the expected coding length for all segments of length .
(1.4.5) |
Obviously, the coding rate measure will change if one uses a different coding scheme (or a code book). In practice, the better we know the low-dimensional structure around which the segments are distributed, the more efficient a code book we can design, as the example shown in Figure 1.21. Acute readers may have recognized that conceptually the denoising process illustrated in Figure 1.20 resembles closely going from the coding scheme with the blue balls to that with the green ones.
Given two coding schemes and for the segments, if the difference in the coding rates is positive:
(1.4.6) |
we may say the coding scheme is better. This difference can be viewed as a measure of how much more information has over about the distribution of the data. To a large extent, the goal of learning is equivalent to finding the most efficient encoding scheme that minimizes the coding rate:
(1.4.7) |
As we will see in Chapter 3, the achievable minimal rate is closely related to the notion of entropy (1.4.3).
The perspective of measuring data complexity with explicit encoding schemes has motivated several learning objectives that were proposed to revise the Kolmogorov complexity for better computability [WD99], including the minimum message length (MML) proposed later in 1968 [WB68] and the minimum description length (MDL) in 1978 [Ris78, HY01]. These objectives normally count the coding length for the coding scheme itself (including its code book) in addition to the data of interest: . However, if the goal is to learn a finite-sized code book and apply it to a large number of sequences, the amortized cost of the code book can be ignored since
as becomes large.
Again, one may view the resulting optimal coding scheme as the one that achieves the best compression of the observed data. In general, compared to the Kolmogorov complexity, the coding length given by any encoding scheme will always be larger:
(1.4.8) |
Therefore, minimizing the coding rate/length is essentially to minimize an upper bound of the otherwise uncomputable Kolmogorov complexity.
Note that if the goal was simply to compress the given data just for the sake of compression, then in theory the optimal codes that approach the Kolmogorov complexity would become nearly random or structureless [Cha66].505050Because any codes with structures can be further compressed. However, our true purpose of learning the predictive function is to use it repeatedly with ease in future predictions. Hence, while compression allows us to identify the low-dimensional distribution in the data, we would like to encode the distribution in a structured and organized way so that the resulting representation is very informative and efficient to use.515151For example, to sample the distribution under different conditions. Figure 1.22 shows an example that explains intuitively why such a transformation is desired.
As we will show in Chapter 3, these desired structures in the final representation can be precisely promoted by choosing a natural measure of information gain based on the coding rates of the chosen coding schemes. As we see throughout this book, such an explicit and constructive coding approach provides a powerful computational framework for learning good representations of low-dimensional structures for real-world data, as in many cases of practical importance, the coding length function can be efficiently computed or approximated accurately. In some benign cases, we can even obtain closed-form formulae, e.g., subspace and Gaussian (see Chapter 3).
In addition, such a computational framework leads to a principled approach that naturally reveals the role that deep networks play in this learning process. As we will derive systematically in Chapter 4, the layers of a deep network are trying to perform operations that optimize the objective function of interest in an incremental manner. From this perspective, the role of deep networks can be precisely interpreted as to emulate a certain iterative optimization algorithm, say gradient descent, to optimize the objective of information gain. Layers of the resulting deep architectures can be endowed with precise statistical and geometric interpretation, namely performing incremental compressive encoding and decoding operations. As a result, the derived deep networks become transparent “white boxes” that are mathematically fully explainable.
To summarize our discussions so far, let us denote the data as:
(1.4.9) |
and let be the codes of via some encoder :
(1.4.10) |
In the machine learning context, is often called “features” or “latent representation”. Note that without knowing the underlying distribution of , we do not know which encoder should be that can retain most useful information about the distribution of . In practice, people often start with trying a certain compact encoding scheme that serves well for a specific task. In particular, they would try to learn an encoder that optimizes certain (empirical) measure of parsimony for the learned representation:
(1.4.11) |
For example, image classification is such a case: we assign all images in the same class to a single code and images in different classes to different codes, say “one-hot” vectors:
(1.4.12) |
Now, a classifier can be modeled as a function that predicts the probability of a given belonging to each of the classes: . Then the “goodness” of a classifier can be measured by the so-called cross entropy:525252The cross entropy can be viewed as a distance measure between the ground truth distribution of and that of the prediction . It can also be viewed as the expected coding length of if we use the optimal code book for to encode . The cross entropy reaches minimum when and have the same distribution.
(1.4.13) |
where indicates the -th entry of the vector . As the early practice of deep networks indicated [KSH12], if enough data are given, such an encoding scheme can often be represented by a deep network and learned in an end-to-end fashion by optimizing the cross-entropy.
The cross-entropy loss can be viewed as a special measure of parsimony associated with a particular family of encoding schemes that are suitable for classification. However, such an encoding is obviously very lossy. The learned does not contain any other information about except for its class type. For example, by assigning an image with (a code representing) the class label “apple”, we no longer know which specific type apple is in the original image from the label alone.
Of course, the other extreme is to require the coding scheme to be lossless. That is, there is a one-to-one mapping between and its code . However, as we will see in Chapter 3, lossless coding (or compression) is impractical unless is discrete. For a continuous random variable, we may only consider lossy coding schemes so that the coding length for the data can be finite. That is, we only encode the data up to a certain prescribed precision. As we will elaborate more in Chapter 3, lossy coding is not merely a practical choice; it plays a fundamental role in making learning the underlying continuous distribution possible from finite samples of the distribution.
For many purposes of learning, we want the feature , although lossy, to keep more information about than just its class type. In this book, we will introduce a more general measure of parsimony based on coding length/rate associated with a more general family of coding schemes – coding with a mixture of subspaces or Gaussians. This family has the capability to closely approximate arbitrary real-world distributions up to a certain precision. As we will see in Chapter 3 and Chapter 4, such a measure will not only preserve most information about the distribution of but will also promote certain nice desired geometric and statistical structures for the learned representation .
In a broader learning context, the main goal of a compressive coding scheme is to identify the low-dimensional structures in the data so that they can be used to predict things in the original data space. This requires that the learned encoding scheme allows an efficient decoding scheme, denoted as . It maps , often known as a latent representation, back to the data space:
(1.4.14) |
In general, we call such an encoding and decoding pair an autoencoding. Figure 1.23 illustrates the process of such an autoencoder.
Generally, we would prefer that the decoding is approximately an “inverse” to the encoding such that the data (distribution) decoded from would be similar to the original data (distribution) to some extent.535353We will make it more precise what we mean by being similar later. If so, we would be able to recover or predict from what is going on in the original data space. In this case, we say the pair gives a consistent autoencoding. For most practical purposes, not only do we need such a decoding to exist, but also can be efficiently realized and physically implemented. For example, if is a real-valued variable, quantification is needed in order for any decoding scheme to be realizable on a finite-state machine, as we will explain more in Chapter 3. Hence, in general, one should expect that realizable encoding and decoding schemes are necessarily lossy. A central question is how to learn a compact (lossy) representation such that it can be used to predict well.
Generally speaking, as we will see, both encoder and decoder could be modeled and realized by deep networks and learned by solving an optimization problem of the following form:
(1.4.15) |
where is a certain distance function that promotes similarity between and 545454Either sample-wise or distribution-wise similar, depending on the choice of the distance function . and is some measure that promotes parsimony and information richness of . The classic principal component analysis (PCA) [Jol02] is a typical example of a consistent autoencoding, which we will study in great detail in Chapter 2, as a precursor to more general low-dimensional structures. In Chapter 5, we will study how to learn consistent autoencoding for general (say nonlinear) low-dimensional distributions.
Note that in the above autoencoding objective, one needs to evaluate how close or consistent the decoded data is to the original . This often requires some external supervision or knowledge on what similarity measure to use. Computing similarity between and can be very expensive, if not entirely impossible or intractable.555555Say one wants to minimize certain distributional distance between the two. Note that in nature, animals are capable of learning all by themselves without comparing their estimate with the ground truth in the data space. They typically do not even have that option.
Then how is a system able to learn autonomously without external supervision or comparison? How can they know that is consistent with even without directly comparing them? That leads to the idea of “closing the loop”. As it turns out, under the mild conditions that we will make precise in Chapter 5, to ensure and are consistent, one only has to encode as and check if and are consistent. We call this notion of consistency as self-consistency, which can be illustrated by the following diagram:
(1.4.16) |
We refer to this process as a closed-loop transcription,565656Inspired by the transcription process between DNA and RNA or other proteins. which is illustrated in Figure 1.24.
It is arguably true that any autonomous intelligent being only needs to learn a self-consistent representation of the observed data , because checking consistency in the original data space (often meaning in the external world) is either too expensive or even not physically feasible. The closed-loop formulation allows one to learn an optimal encoding and decoding via a minmax game that depends only on the internal (learned) feature :
(1.4.17) |
where is a loss function based on coding rates of the features and , which, as we will see, can be much easier to compute. Here again, is some measure that promotes parsimony and information richness of . Somewhat surprisingly, as we will see in Chapter 5, under rather mild conditions such as being sufficiently low-dimensional, self-consistency between implies consistency in ! In addition, we will also see that a closed-loop system will allow us to learn the distribution in a continuous and incremental manner,575757That is to learn with one class at a time or even one sample at a time. without suffering from problems such as catastrophic forgetting associated with open-loop models.
So far, we have introduced three related frameworks for learning a compact and structured representation for a given data distribution :
The open-ended encoding (1.4.10);
The bi-directional autoencoding (1.4.14);
The closed-loop transcription (1.4.16).
In this book, we will systematically study all three frameworks, one after another:
(1.5.1) |
in Chapter 4, Section 5.1 and Section 5.2 of Chapter 5, respectively.
In the past few years, many theoretical frameworks have been proposed and developed to help understand deep networks. However, many were unable to provide scalable solutions that matched the performance of empirical methods on real-world data and tasks. Many theories do not provide useful guidance on how to further improve practice. Chapters 6 and 7 will show how the framework presented in this book may help bridge the gap between theory and practice. Chapter 6 will show how to use the learned distribution and its representation to conduct (Bayesian) inference for almost all practical tasks that depend on (conditional) generation, estimation, and prediction. Chapter 7 will provide convincing experimental evidence that networks and systems designed from the first principles can achieve competitive and potentially better performance on a variety of tasks such as visual representation learning, image classification, image completion, segmentation, and text generation.
As we have mentioned in the beginning, a common and fundamental task of any intelligent being is to learn predictable information from its sensed data. Now we have understood a little about the computational nature of this task, and one should realize that this is a never-ending process, for the following reasons:
The knowledge learned so far from the data, say by the encoding and decoding schemes, is unlikely to be correct or optimal. Intelligence should have the ability to improve if there are still errors in predicting new observations.
The data observed so far do not yet cover all the predictable information. Intelligence should be able to recognize that current knowledge is inadequate and have the capability to learn and acquire new information whenever available.
Hence, intelligence is not about simply collecting all data in advance and training a model to memorize all the predictable information in the data. In contrast, it is about being equipped with computational mechanisms that can constantly improve current knowledge and acquire new information when available and needed. That is, a fundamental characteristic of any intelligent being or system585858An animal, a human being, an intelligent robot, the scientific community, and even the entire civilization. is being able to continuously improve or gain information (or knowledge) on its own. Conceptually, we may illustrate symbolically the relationship between intelligence and information (or knowledge) as follows:
(1.5.2) |
We believe that the closed-loop framework is a universal mechanism that enables self-improving and self-learning, via feedback595959Reinforcement can be viewed as a primitive form of feedback, say the natural selection by the nature. or gaming606060Scientific inquiries can be viewed as a most advanced form of gaming, through hypothesis formulation and verification.. All intelligent beings or systems in nature utilize closed-loop mechanisms for learning at all levels and on all scales. Its ubiquity had inspired early studies that tried to model and emulate intelligence by machines and computers, particularly the Cybernetics movement initiated by Norbert Wiener in the 1940s.
We hope that this book will help people better understand the objectives, principles, and computational mechanisms behind intelligence. It serves as a foundation for further study of higher-level human intelligence, the true “artificial” intelligence, in the future, which we will layout several significant open problems in these new directions at the end of the book in Chapter 8.