Chapter 1 An Informal Introduction

Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy.

  – Václav Havel

1.1 Intelligence, Cybernetics, and Artificial Intelligence

The world in which we live is neither fully random nor completely unpredictable.111Note there is no need for an intelligent being to learn or memorize anything if the world is fully random. Instead, it follows certain orders, patterns, and laws that make it largely predictable.222Some deterministic and some probabilistic. The very emergence and existence of life depend on a predictable living environment. Only by learning and memorizing what is predictable in the environment can life survive and thrive since good decisions and actions depend on reliable predictions. Because there seem to be unlimited things that are predictable about the world, intelligent beings, such as animals and humans, have continued to improve through evolution their capability to explore and exploit such predictability for an increasingly better life. To this end, they have developed increasingly more acute senses, including vision, hearing, touch, taste, and smell, to perceive what is predictable in the external environment from these high-throughput sensory data. Hence a fundamental task for all intelligent beings is to be able to:

learn and memorize predictable information from massive sensed data.

Before we may begin to understand how this is done, we need to consider a few related questions:

  • How to model and represent such predictable information in the data mathematically?

  • How can such information be learned effectively and efficiently from the data computationally?

  • How should such information be best organized for future prediction and inference?

This book aims to provide some answers to these questions. These answers will help us better understand intelligence, especially the computational principles and mechanisms that enable it. There is reason to believe that all forms of intelligence, from low-level intelligence seen in early primitive life to the highest form of intelligence, the practice of modern science, follow the same set of principles and mechanisms. We will elaborate more below.

Emergence and evolution of intelligence.

A necessary condition for the emergence of life on earth about 4 billion years ago is that the earth’s environment is largely predictable. In the environment, life has developed mechanisms that allow it to learn what is predictable about the environment, encode the information in a certain way, and use it for survival. Generally speaking, we call this ability to learn intelligence. To a large extent, the evolution of life is the mechanism of intelligence at work. In nature, intelligence is mainly developed through two types of learning mechanisms: phylogenetic and ontogenetic [Wie61].

Phylogenetic intelligence refers to learning through the evolution of species. Species inherit and survive mainly based on knowledge inherited from the DNA or genes of their parents. To a large extent, we may call DNA nature’s pre-trained large models because they play a very similar role. The main characteristic of phylogenetic intelligence is that individuals do not have much learning capacity. Learning is carried out with a “trial-and-error” mechanism based on random mutation of the genes, and then species evolve based on the process of natural selection – survival of the fittest, as shown in Figure 1.1. This can be viewed as nature’s way of implementing what is now known as “reinforcement learning.” However, such a “trial-and-error” learning process can be extremely slow, costly, and unpredictable: It is known that from the emergence of the first life forms, from about 4.4 to 3.8 billion years ago, life has relied on this form of evolution.333Astute readers may have noticed an uncanny similarity between how early life evolves and how large language models evolve today.

Figure 1.1 : Evolution of phylogenetic intelligence: Knowledge of the external world is encoded and passed on via DNA (left), and it is decoded from DNA to RNA and to protein etc. In the early stage of life evolution (right), intelligence develops knowledge at the species level via (random) gene mutation and natural selection – “may the fittest survive,” which can be viewed as a primitive form of reinforcement learning.
Figure 1.1 : Evolution of phylogenetic intelligence: Knowledge of the external world is encoded and passed on via DNA (left), and it is decoded from DNA to RNA and to protein etc. In the early stage of life evolution (right), intelligence develops knowledge at the species level via (random) gene mutation and natural selection – “may the fittest survive,” which can be viewed as a primitive form of reinforcement learning.
Figure 1.1: Evolution of phylogenetic intelligence: Knowledge of the external world is encoded and passed on via DNA (left), and it is decoded from DNA to RNA and to protein etc. In the early stage of life evolution (right), intelligence develops knowledge at the species level via (random) gene mutation and natural selection – “may the fittest survive,” which can be viewed as a primitive form of reinforcement learning.
Figure 1.2 : Evolution of life, from the ancestor of all life today (named LUCA — last universal common ancestor), a single-cell-like organism which lived from 3.5-4.3 billion years ago, to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago, to the explosion of life forms in the Cambrian period (right), about 530 million years ago.
Figure 1.2 : Evolution of life, from the ancestor of all life today (named LUCA — last universal common ancestor), a single-cell-like organism which lived from 3.5-4.3 billion years ago, to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago, to the explosion of life forms in the Cambrian period (right), about 530 million years ago.
Figure 1.2 : Evolution of life, from the ancestor of all life today (named LUCA — last universal common ancestor), a single-cell-like organism which lived from 3.5-4.3 billion years ago, to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago, to the explosion of life forms in the Cambrian period (right), about 530 million years ago.
Figure 1.2: Evolution of life, from the ancestor of all life today (named LUCA — last universal common ancestor), a single-cell-like organism which lived from 3.5-4.3 billion years ago, to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago, to the explosion of life forms in the Cambrian period (right), about 530 million years ago.

Ontogenetic intelligence refers to the learning mechanisms that allow an individual to learn through its own senses, memories, and predictions within its specific living environment and to improve and adapt its behaviors. Ontogenetic learning became possible after the emergence of the nervous system about 550 to 600 million years ago (in worm-like organisms), shown in Figure 1.2 middle. That is, with a sensory and nervous system, an individual is capable of continuously forming and improving its own knowledge about the world, also known as memory, in addition to what is inherited from its DNA or genes. This capability has significantly enhanced the survival of the individual and contributed to the explosion of life forms in the Cambrian period about 530 million years ago. Compared to phylogenetic learning, ontogenetic learning is significantly more efficient and predictable, which can be realized within the resource limits of an individual in its lifespan.

Notice that both types of learning mechanisms rely on some form of feedback (from the external environment), in terms of a penalty (death) or reward (food), of a species or individuals’ actions444Gene mutation of the species or actions made by the individual. to learn. As a result, all intelligent beings, as species or as individuals, rely on a closed-loop feedback mechanism to learn and improve their knowledge about the world. We also notice that from plants, to fish, to birds, and to mammals, more advanced species rely more and more on their ontogenetic learning capabilities. They stay with and learn from their parents longer and longer after birth, because individuals of the same species need to survive in very diverse environments.

Evolution of human intelligence.

Since the emergence of homo sapiens about 315 thousand years ago, a new and higher form of intelligence has emerged which evolves more efficiently and economically. Languages, first spoken555It was believed that Sanskrit was the first spoken language, dating back to 5000 BC. and then written666Sumerian language is believed to be one of the oldest written languages in existence, first attested about 3100 BC in southern Mesopotamia., were developed a few thousand years ago. See Figure 1.3. This allows individuals to communicate and share useful information with others. Therefore, a human community or society can behave like a single intelligent organism that can learn much faster and hold more knowledge than any individual. In a way, written languages, or texts, play a role similar to DNA and genes as they allow human societies to accumulate and pass knowledge of the world onto the next generations. We may refer to this type of intelligence as societal intelligence, to distinguish it from the phylogenetic intelligence of species and the ontogenetic intelligence of individuals. This type of knowledge accumulation serves as the foundation of ancient civilizations.

Figure 1.3 : The development of verbal communication and spoken languages (between 10000-5000 BC), written languages (about 3000 BC), and mathematics (around 500-300 BC) mark three key milestones in the evolution of human intelligence.
Figure 1.3 : The development of verbal communication and spoken languages (between 10000-5000 BC), written languages (about 3000 BC), and mathematics (around 500-300 BC) mark three key milestones in the evolution of human intelligence.
Figure 1.3 : The development of verbal communication and spoken languages (between 10000-5000 BC), written languages (about 3000 BC), and mathematics (around 500-300 BC) mark three key milestones in the evolution of human intelligence.
Figure 1.3: The development of verbal communication and spoken languages (between 10000-5000 BC), written languages (about 3000 BC), and mathematics (around 500-300 BC) mark three key milestones in the evolution of human intelligence.

Quite miraculously, about a few thousand years ago, another quantum leap in human intelligence occurred, which allowed philosophers and mathematicians to develop knowledge that seems to go way beyond developing empirical knowledge. The development of abstract mathematical concepts and symbols, such as numbers, space and time, as well as mathematical logic, serve as a new precise language for modern science. In addition, the development of the ability to generate new hypotheses and verify their correctness based on logical deduction or scientific experimentation. This, for the first time, has enabled humans to proactively develop new knowledge beyond passive empirical means. The ability to conduct these high-level forms of knowledge development is believed to be unique to humans. This advanced form of intelligence is referred to as “artificial intelligence” (AI), coined by John McCarthy at the Dartmouth summer workshop in 1956.

Hence, from what we can learn from nature, from now on, whenever we use the word “intelligence,” we need to be very specific about which level/form of intelligence we mean:

phylogeneticontogeneticsocietalartificial intelligence.\mbox{{phylogenetic}}\;\Longrightarrow\;\mbox{{ontogenetic}}\;\Longrightarrow\;\mbox{{societal}}\;\Longrightarrow\;\mbox{{artificial intelligence}}.phylogenetic ⟹ ontogenetic ⟹ societal ⟹ artificial intelligence . (1.1.1)

A clear characterization and distinction are necessary and important because we want to study intelligence as a scientific and mathematical subject. It is highly likely that, even if they all may share the common objective of learning useful knowledge of the world, the specific computational mechanisms and physical implementations behind each level/form of intelligence could be different. We believe that the reader would better understand and appreciate these differences after having finished study this book. Therefore, we will leave more discussions about general intelligence to the last Chapter 8.

Origin of machine intelligence – Cybernetics.

In the 1940s, partly due to the war effort, intelligence in nature had inspired scientists to emulate animal intelligence by machines, which led to the “Cybernetics” movement advocated by Norbert Wiener. Wiener studied zoology at Harvard as an undergraduate but later became a mathematician and control theorist. Wiener had a lifelong passion for understanding and developing autonomous systems that could emulate intelligent behaviors of animals. Today, the Cybernetics program is often narrowly interpreted by people as mainly about feedback control systems for which Wiener indeed made his most significant technical contributions. But the Cybernetics program was much broader and deeper than that. It is more about understanding intelligence as a whole777At least at the level of animals. and had actually influenced the work of a whole generation of renowned scientists, including Warren McCulloch, Walter Pitts, Claude Shannon, John von Neumann, and Alan Turing.

Figure 1.4 : The book “Cybernetics” by Norbert Wiener published in 1948 [ Wie48 ] (left) and the second edition in 1961 [ Wie61 ] (right).
Figure 1.4 : The book “Cybernetics” by Norbert Wiener published in 1948 [ Wie48 ] (left) and the second edition in 1961 [ Wie61 ] (right).
Figure 1.4: The book “Cybernetics” by Norbert Wiener published in 1948 [Wie48] (left) and the second edition in 1961 [Wie61] (right).

Wiener was arguably the first person who studied intelligence as a system, instead of paying attention to only one component or aspect of it. His comprehensive views on intelligence were elaborated in his famous 1948 book “Cybernetics: or Control and Communication in the Animal and the Machine” [Wie48]. In this book and its second edition published in 1961 [Wie61] (see Figure 1.4), he tried to identify several necessary characteristics and mechanisms of intelligent systems, which include (but are not limited to):

  • How to measure and store information (in the brain) and how to communicate with others. 888Norbert Wiener was the first to point out that “information” is not matter nor energy, but an independent quantity for study. This led to the formulation of information theory and coding theory by Claude Shannon in 1948.

  • How to correct errors in prediction and estimation based on existing information. Norbert Wiener himself helped formalize the theory for control systems based on closed-loop feedback in the 1940s.

  • How to learn to make better decisions from interacting with a potentially non-cooperative opponent or adversarial environment. This was formalized by John von Neumann as game theory in 1944.

In 1943, very much motivated by Wiener’s Cybernetics program, the psychiatrist Warren McCulloch and the logician Walter Pitts together formalized the first computational model for a neuron [MP43], called an artificial neuron, as illustrated later in Figure 1.13. Based on this model, in the 1950s, Frank Rosenblatt built a physical machine, named the Mark I Perceptron, with a network of hundreds of such artificial neurons. Perceptron was the first artificial neural network physically realized, see Figure 1.15. Notably, John von Neumann’s universal computer architecture, proposed in 1945, was also designed to facilitate the goal of building computing machines that can physically realize the mechanisms suggested by the Cybernetics program.

Acute readers probably have noticed that the 1940s was truly a magical era: So many fundamental ideas were invented and influential theories formalized in that era, including the mathematical model of neurons, artificial neural networks, information theory, control theory, game theory, and computing machines. Figure 1.5 shows some of the pioneers of these theories. As we now know, each work above had grown to become the foundation of a scientific or engineering field for the following many decades and has tremendous impact on us. All these fundamental theories were inspired and motivated by the goal of trying to develop machines that can emulate intelligence in nature. Based on historical notes, Wiener’s Cybernetics movement had influenced almost all of these people and work. To a large extent, the Cybernetics program laid out by Wiener can be viewed as the true predecessor to the currently very popular “embodied” intelligence program. In fact, Wiener had described the program in much more concrete terms [Wie61].

Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5: Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).

Although Wiener had identified in his work many key characteristics and mechanisms of (embodied) intelligence, there was no indication that he knew how to properly integrate all these mechanisms together to build a complete autonomous intelligent system. Judging from today’s knowledge, some of his views on these mechanisms were not entirely accurate or complete. In particular, in the last chapter of the second edition of Cybernetics [Wie61], he pointed out that it is crucial to deal with nonlinearality if a machine learning system is designed to emulate typical learning mechanisms in nature. But he did not provide any concrete and effective solutions to this difficult issue. To his defense though, at the time, few people knew how, since even the theory for dealing with linear models and systems was still in its infancy.

Nevertheless, we could not help but marvel at Wiener’s foresight about the importance of nonlinearity. As we will see in this book, the answer was found only recently: nonlinearity can be effectively dealt with through progressive linearization and transformation realized by deep neural networks (see Chapter 4). In addition, we will attempt to show in this book how all these mechanisms listed above can be naturally integrated into a complete system which would exhibit characteristics of an autonomous intelligent system (see Chapter 5).

Origin of Artificial Intelligence.

From the subtitle of Wiener’s Cybernetics book: “Control and Communication in the Animal and the Machine”, one can tell that the studies in the 1940s mainly aimed to emulate intelligence at the level of animals. As we mentioned before, the research agendas about intelligence around the 1940s were very much dominated by Norbert Wiener’s Cybernetics movement.

Alan Turing was one of the first to notice this limitation. In his famous 1950 paper “Computing Machinery and Intelligence[Tur50], Turing formally posted the question whether machines can imitate intelligence even at the human level, to the point of being indistinguishable from the intelligent capabilities of humans. This is now known as the Turing test.

Around 1955, a group of young and ambitious scientists tried to break away from the then dominating Cybernetics movement and research agendas so that they would have a chance to create their own legacy. They decided to take on Turing’s challenge of imitating human intelligence and proposed a workshop to be held at Dartmouth College in the summer of 1956. They made their intention clear with a statement in their proposal:

The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.”

In essence, they wanted to formalize and study higher-level intelligence that differentiates humans from animals. The topics they considered ranged from abstraction, symbolic methods, natural languages, and deductive methods (including causal inference, logic deduction, etc.) The organizer of the workshop, John McCarthy, then a young assistant professor of Mathematics of the Dartmouth College, coined the now famous term “Artificial Intelligence” (AI) to encapsulate the set of characteristics or mechanisms that are believed to be unique to human intelligence.

The renaissance of “Artificial Intelligence” or “Cybernetics”?

As the readers may have known, in the past decade or so, machine intelligence has undergone explosive development, powered mainly by the practice of deep artificial neural networks, triggered by the work of Geoffrey Hinton and students in 2012 [KSH12]. This era is also hailed as the “Renaissance” of Artificial Intelligence (AI). However, in terms of tasks that people have actually tried to tackle (recognition, generation, and prediction) and techniques that people have developed and implemented so far (reinforcing, imitating, encoding, decoding, denoising, and compression), we are very much just emulating the mechanisms that are common to the intelligence of early life and animals. Even in that regard, as we will try to clarify in this book, current “AI” models and systems have not correctly implemented all necessary mechanisms for intelligence at these levels, which were already known to the Cybernetics movement in the 1940s.

Hence, strictly speaking, the advancement of machine intelligence in the past decade does not align well with the “Artificial Intelligence” program laid out in the 1956 Dartmouth workshop. Instead, what has been predominantly accomplished so far is more closely related to the objectives of the classic “Cybernetics” program laid out by Norbert Wiener in the 1940s. It is probably more appropriate to call the current era the “Renaissance of Cybernetics”.999The recent rise of the so-called “Embodied AI” for autonomous intelligent robots share even more similarity with the goals of the Cybernetics program. Only after we have fully understood what we have truly done from the scientific and mathematical perspective, can we truly know what remains to be done and which direction to go to pursue the true nature of intelligence. This is one of the main purposes of this book.

1.2 What to Learn?

1.2.1 Predictability

Data that carry useful information manifest in many different forms. In the most natural form, they can be modeled as sequences that are predictable and computable. The notion and properties of a predictable and computable sequence were at the heart of the theory of computing and largely led to the invention of computers [Tur36]. The role of predictable sequences in (inductive) inference was studied by Ray Solomonoff, Andrey Kolmogorov, and many others in the 1960s [Kol98] as a generalization of Claude Shannon’s classic Information Theory [Sha48]. To understand the concept of predictable sequences, let us first start with some concrete simple examples.

Scalar Case.

The simplest predictable discrete sequence is arguably the sequence of natural numbers:

S=1,2,3,4,5,6,,n,n+1,{S}=1,2,3,4,5,6,\ldots,n,n+1,\ldotsitalic_S = 1 , 2 , 3 , 4 , 5 , 6 , … , italic_n , italic_n + 1 , … (1.2.1)

in which the next number xn+1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is defined as its previous number xnx_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT plus 1:

xn+1=xn+1.x_{n+1}=x_{n}+1.italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 1 . (1.2.2)

One may generalize the notion of predictability to any sequence {xn}n=1\{x_{n}\}_{n=1}^{\infty}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT with xnx_{n}\in\mathbb{R}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R if the next number xn+1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT can always be computed from its previous one xnx_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

xn+1=f(xn),xn,n=1,2,3,x_{n+1}=f(x_{n}),\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.3)

where f()f(\cdot)italic_f ( ⋅ ) is a computable (scalar) function.101010Here we emphasize that the function f()f(\cdot)italic_f ( ⋅ ) itself is computable, meaning it can be implemented as a program on a computer. Note that here we emphasize that the function f()f(\cdot)italic_f ( ⋅ ) must be computable. There are many functions that can be defined but are not computable. Alan Turing’s seminal work in 1936 [Tur36] gives a rigorous definition of computability. In practice, we often further assume that ffitalic_f is efficiently computable and has nice properties such as being continuous and differentiable, etc. The necessity of these properties will become clear later once we understand more about more refined notions of computability, and their roles in machine learning and intelligence.

Multi-Variable Case.

Of course, the value of the next number can also depend on two of its predecessors. For example, the famous Fibonacci sequence is defined as:

S=1,1,2,3,5,8,13,21,34,55,{S}=1,1,2,3,5,8,13,21,34,55,\ldotsitalic_S = 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34 , 55 , … (1.2.4)

where one can easily see:

xn+2=xn+1+xn,xn,n=1,2,3,x_{n+2}=x_{n+1}+x_{n},\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.5)

Similarly, we may generalize this recursion to

xn+2=f(xn+1,xn),xn,n=1,2,3,x_{n+2}=f(x_{n+1},x_{n}),\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.6)

where f(,)f(\cdot,\cdot)italic_f ( ⋅ , ⋅ ) is any computable function that takes two variables as input. We can further generalize the notion of predictability to a sequence whose next value depends on say dditalic_d of its predecessors:

xn+d=f(xn+d1,,xn),xn,n=1,2,3,x_{n+d}=f(x_{n+d-1},\ldots,x_{n}),\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + italic_d end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n + italic_d - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.7)

The number of predecessors dditalic_d needed for the prediction is called the degree of the recursive prediction. The above expression (1.2.7) is also called an (auto) regression. Such a sequence is also called an auto-regressive sequence. If the function ffitalic_f is a linear function, we call it a linear (auto) regression.

Vector Case.

To simplify the notation, we may define a vector 𝒙d\bm{x}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that collects dditalic_d consecutive values in the sequence:

𝒙n[xn+d1,,xn],𝒙nd,n=1,2,3,\bm{x}_{n}\doteq[x_{n+d-1},\ldots,x_{n}]^{\top},\quad\bm{x}_{n}\in\mathbb{R}^{d},\;n=1,2,3,\ldotsbold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≐ [ italic_x start_POSTSUBSCRIPT italic_n + italic_d - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_n = 1 , 2 , 3 , … (1.2.8)

With this notation, the recursive relation (1.2.7) can be conveniently written as

𝒙n+1=g(𝒙n)d,n=1,2,3,\bm{x}_{n+1}=g(\bm{x}_{n})\;\in\mathbb{R}^{d},\quad n=1,2,3,\ldotsbold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_n = 1 , 2 , 3 , … (1.2.9)

where the function g()g(\cdot)italic_g ( ⋅ ) is uniquely defined by the function ffitalic_f in (1.2.7) and takes a dditalic_d-dimensional vector as input. In different contexts, such a vector is sometimes referred to as a “state” or a “token”. Note that the equation in (1.2.7) denotes a mapping d\mathbb{R}^{d}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, but the equation here is g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Controlled Prediction.

We may also define a predictable sequence that depends on another predictable sequence as input:

𝒙n+1=f(𝒙n,𝒖n)d,n=1,2,3,,\bm{x}_{n+1}=f(\bm{x}_{n},\bm{u}_{n})\;\in\mathbb{R}^{d},\quad n=1,2,3,\ldots,bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_n = 1 , 2 , 3 , … , (1.2.10)

where {𝒖n}\{\bm{u}_{n}\}{ bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with 𝒖nk\bm{u}_{n}\in\mathbb{R}^{k}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a (computable) predictable sequence. In other words, the next vector 𝒙n+1d\bm{x}_{n+1}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT depends on both 𝒙nd\bm{x}_{n}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒖nk\bm{u}_{n}\in\mathbb{R}^{k}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. In the context of control theory, the sequence {𝒖n}\{\bm{u}_{n}\}{ bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is often referred to as the “control input” and 𝒙n\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the “state” or “output” of the system (1.2.10). One classic example is a linear dynamical system:

𝒙n+1=𝑨𝒙n+𝑩𝒖n,𝑨d×d,𝑩d×k,\bm{x}_{n+1}=\bm{A}\bm{x}_{n}+\bm{B}\bm{u}_{n},\quad\bm{A}\in\mathbb{R}^{d\times d},\bm{B}\in\mathbb{R}^{d\times k},bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = bold_italic_A bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_B bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT , bold_italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT , (1.2.11)

which is widely studied in control theory [CD91].

Very often the control input is given by a computable function of the state 𝒙n\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT itself:

𝒖n=h(𝒙n),n=1,2,3,\bm{u}_{n}=h(\bm{x}_{n}),\quad n=1,2,3,\ldotsbold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_n = 1 , 2 , 3 , … (1.2.12)

As a result, the sequence {𝒙n}\{\bm{x}_{n}\}{ bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is given by composing the two computable functions ffitalic_f and hhitalic_h as:

𝒙n+1=f(𝒙n,h(𝒙n)),n=1,2,3,\bm{x}_{n+1}=f\big{(}\bm{x}_{n},h(\bm{x}_{n})\big{)},\quad n=1,2,3,\ldotsbold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) , italic_n = 1 , 2 , 3 , … (1.2.13)

In this way, the sequence {𝒙n}\{\bm{x}_{n}\}{ bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } again becomes an auto-regressive predictable sequence. When the input 𝒖n\bm{u}_{n}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT depends on the output 𝒙n\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we say the resulting sequence is produced by a “closed-loop” system (1.2.13). As the closed-loop system no longer depends on any external input, we say such a system has become autonomous. It can be viewed as a special case of auto-regression. For instance, if we choose in the above linear system (1.2.11), 𝒖n=𝑭𝒙n\bm{u}_{n}=\bm{F}\bm{x}_{n}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_F bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the closed-loop system becomes

𝒙n+1=𝑨𝒙n+𝑩𝒖n=𝑨𝒙n+𝑩𝑭𝒙n=(𝑨+𝑩𝑭)𝒙n,\bm{x}_{n+1}=\bm{A}\bm{x}_{n}+\bm{B}\bm{u}_{n}=\bm{A}\bm{x}_{n}+\bm{B}\bm{F}\bm{x}_{n}=(\bm{A}+\bm{B}\bm{F})\bm{x}_{n},bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = bold_italic_A bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_B bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_A bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_B bold_italic_F bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_italic_A + bold_italic_B bold_italic_F ) bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (1.2.14)

which is a linear auto-regression.

Continuous Processes.

Predictable sequences have natural counterparts in the continuous case. We may refer to them as predictable processes. Similar to the sequence of natural numbers, the simplest predictable continuous process is time itself x=tx=titalic_x = italic_t.

More generally, we say a process, denoted by 𝒙(t)\bm{x}(t)bold_italic_x ( italic_t ), is predictable if at any time ttitalic_t, the value of the process at t+δtt+\delta titalic_t + italic_δ italic_t, where δt\delta titalic_δ italic_t is an infinitesimal increment, is determined by its value at ttitalic_t. Typically, the change in value δ𝒙(t)\delta\bm{x}(t)italic_δ bold_italic_x ( italic_t ) is continuous and smooth. So δ𝒙(t)=𝒙(t+δt)𝒙(t)\delta\bm{x}(t)=\bm{x}(t+\delta t)-\bm{x}(t)italic_δ bold_italic_x ( italic_t ) = bold_italic_x ( italic_t + italic_δ italic_t ) - bold_italic_x ( italic_t ) is infinitesimally small. Predictable processes are typically described by (multivariate) differential equations:

𝒙˙(t)=f(𝒙(t)),𝒙d.\dot{\bm{x}}(t)=f(\bm{x}(t)),\quad\bm{x}\in\mathbb{R}^{d}.over˙ start_ARG bold_italic_x end_ARG ( italic_t ) = italic_f ( bold_italic_x ( italic_t ) ) , bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (1.2.15)

In the context of systems theory [CD91, Sas99], the above equation is also known as a state-space model. Similar to the discrete case, a controlled process can be given by:

𝒙˙(t)=f(𝒙(t),𝒖(t)),𝒙d,𝒖k,\dot{\bm{x}}(t)=f(\bm{x}(t),\bm{u}(t)),\quad\bm{x}\in\mathbb{R}^{d},\bm{u}\in\mathbb{R}^{k},over˙ start_ARG bold_italic_x end_ARG ( italic_t ) = italic_f ( bold_italic_x ( italic_t ) , bold_italic_u ( italic_t ) ) , bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (1.2.16)

where 𝒖(t)\bm{u}(t)bold_italic_u ( italic_t ) is a computable input process.

Example 1.1.

For example in physics, Newton’s second law of motion describes how to predict the trajectory 𝒙(t)3\bm{x}(t)\in\mathbb{R}^{3}bold_italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT of a moving object under a force input 𝑭(t)3\bm{F}(t)\in\mathbb{R}^{3}bold_italic_F ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:

m𝒙¨(t)=𝑭(t).m\ddot{\bm{x}}(t)=\bm{F}(t).italic_m over¨ start_ARG bold_italic_x end_ARG ( italic_t ) = bold_italic_F ( italic_t ) . (1.2.17)

When there is no force 𝑭(t)0\bm{F}(t)\equiv 0bold_italic_F ( italic_t ) ≡ 0, the above law reduces to a special case, known as Newton’s first law: the object maintains a constant speed in a straight line:

𝒙¨(t)=𝟎𝒙˙(t)=𝒗\ddot{\bm{x}}(t)=\bm{0}\;\Leftrightarrow\;\dot{\bm{x}}(t)=\bm{v}over¨ start_ARG bold_italic_x end_ARG ( italic_t ) = bold_0 ⇔ over˙ start_ARG bold_italic_x end_ARG ( italic_t ) = bold_italic_v (1.2.18)

for some constant velocity vector 𝒗3\bm{v}\in\mathbb{R}^{3}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. \blacksquare

1.2.2 Low Dimensionality

Learn to Predict.

Now suppose you have observed or have been given many sequence segments:

{S1,S2,,Si,,SN}\{S_{1},S_{2},\ldots,S_{i},\ldots,S_{N}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } (1.2.19)

all from some predictable sequence {xn}n=1\{x_{n}\}_{n=1}^{\infty}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT. Without loss of generality, we may assume the length of each segment is DdD\gg ditalic_D ≫ italic_d. So each segment is of the form:

Si=[xj(i),xj(i)+1,,xj(i)+D1]DS_{i}=[x_{j(i)},x_{j(i)+1},\ldots,x_{j(i)+D-1}]^{\top}\in\mathbb{R}^{D}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_j ( italic_i ) end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j ( italic_i ) + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_j ( italic_i ) + italic_D - 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT (1.2.20)

for some jj\in\mathbb{N}italic_j ∈ blackboard_N. Then you are given a new segment StS_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and asked to predict its future values.

One difficulty here is that you normally do not know the function ffitalic_f and the degree dditalic_d from which the sequence is generated:

xn+d=f(xn+d1,,xn).x_{n+d}=f(x_{n+d-1},\ldots,x_{n}).italic_x start_POSTSUBSCRIPT italic_n + italic_d end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n + italic_d - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (1.2.21)

So the hope is somehow “to learn” ffitalic_f and dditalic_d from the given sample segments S1,S2,,SNS_{1},S_{2},\ldots,S_{N}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. Hence the central task of learning to predict is:

Given many sampled segments of a predictable sequence, how to effectively and efficiently identify the function ffitalic_f.

Predictability and Low-Dimensionality.

To identify the predictive function ffitalic_f, we may notice a common characteristic of segments of any predictable sequence given by (1.2.21). If we take a long segment, say with a length DdD\gg ditalic_D ≫ italic_d, of the sequence and view it as a vector:

𝒙i=[xi,xi+1,xi+D1]D.\bm{x}_{i}=[x_{i},x_{i+1},\ldots x_{i+D-1}]^{\top}\in\mathbb{R}^{D}.bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_i + italic_D - 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT . (1.2.22)

Then the set of all such vectors {𝒙i}\{\bm{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is far from random and hence cannot possibly occupy the entire space of D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Instead, they essentially have at most dditalic_d degrees of freedom – given the first dditalic_d entries of any 𝒙i\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the values of the rest of the entries are uniquely determined. In other words, all {𝒙i}\{\bm{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } lie on a dditalic_d-dimensional surface. In mathematics, such a surface is often called a submanifold, denoted as 𝒮D\mathcal{S}\subset\mathbb{R}^{D}caligraphic_S ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Figure 1.6 : A two-dimensional subspace in a ten-dimensional ambient space.
Figure 1.6: A two-dimensional subspace in a ten-dimensional ambient space.

In practice, if we choose the segment length DDitalic_D to be large enough, then all segments sampled from the same predicting function lie on a surface with an intrinsic dimension dditalic_d, significantly lower than that of the ambient space DDitalic_D. For example, if the sequence is given by the following linear autoregression:

xn+2=axn+1+bxn,x_{n+2}=a\cdot x_{n+1}+b\cdot x_{n},italic_x start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT = italic_a ⋅ italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + italic_b ⋅ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (1.2.23)

for some constants a,ba,b\in\mathbb{R}italic_a , italic_b ∈ blackboard_R. If we sample segments of length D=10D=10italic_D = 10 from such a sequence, then all samples lie on a two-dimensional plane or subspace in 10\mathbb{R}^{10}blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, as illustrated in Figure 1.6. If we can identify this two-dimensional subspace, the constants aaitalic_a and bbitalic_b in (1.2.23) can be fully determined.

It is easy to see that if the predicting function ffitalic_f is linear, such as the case with the linear systems given in (1.2.11) and (1.2.14), the long segments always lie on a certain low-dimensional linear subspace. Identifying the predicting function is largely equivalent to identifying this low-dimensional subspace, a problem generally known as principal component analysis. We will discuss such classic models and methods in Chapter 2.

As it turns out, this is largely true for general predictable sequences as well: if one can identify the low-dimensional surface on which all the segment samples lie, then one can identify the associated predictive function ffitalic_f.111111Under mild conditions, there is a one-to-one mapping between the low-dimensional surface and the function ffitalic_f. This fact has been exploited in problems such as system identification which we will discuss later. We cannot over-emphasize the importance of this property of segments from a predictable sequence: All samples of long segments of a predictable sequence lie on a low-dimensional submanifold. As we will see in this book, all modern learning methods essentially exploit this property, implicitly or explicitly.

In real-world scenarios, the data we observe often do not come from a single predictable sequence. Typically they contain observations of multiple predictable sequences. For example, a video sequence can contain multiple moving objects. It is easy to see that in such scenarios, the data lie on a mixture of multiple low-dimensional linear subspaces or nonlinear submanifolds, as illustrated in Figure 1.7.

Figure 1.7 : Data distributed on a mixture of (orthogonal) subspaces (left) or submanifolds (right).
Figure 1.7: Data distributed on a mixture of (orthogonal) subspaces (left) or submanifolds (right).
Properties of Low-Dimensionality.

Of course, temporal correlation in predictable sequences is not the only reason data are low-dimensional. For example, the space of all images is humongous but most of the space consists of images that resemble structureless random images as shown in Figure 1.8 left. Natural images and videos, however, are highly redundant because there is a strong spatial and temporal correlation among all pixel values. This is the reason why we can easily recognize whether an image is noisy or clean, as shown in Figure 1.8 middle and right. Hence the distribution of natural images has a very low intrinsic dimension (relative to the total number of pixels of an image).

Figure 1.8 : An image of random noise versus a noisy image and the original clean image.
Figure 1.8 : An image of random noise versus a noisy image and the original clean image.
Figure 1.8 : An image of random noise versus a noisy image and the original clean image.
Figure 1.8: An image of random noise versus a noisy image and the original clean image.

Due to the importance and ubiquity of the task of learning low-dimensional structures, the book “High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications” [WM22] has stated in the beginning with a statement: “The problem of identifying the low-dimensional structure of signals or data in high-dimensional spaces is one of the most fundamental problems that, through a long history, interweaves many engineering and mathematical fields such as system theory, signal processing, pattern recognition, machine learning, and statistics.

Note that by enforcing the observed data point 𝒙\bm{x}bold_italic_x to be on a low-dimensional surface, we have essentially made the entries of 𝒙\bm{x}bold_italic_x very dependent on each other and in some sense have made the entries very “predictable” from values of other entries. For example, if we know that the data are constrained on a dditalic_d-dimensional surface in d\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, then it allows us to do many useful things with the data besides prediction:

  • completion: in general, given more than dditalic_d entries of a typical sample 𝒙\bm{x}bold_italic_x, the rest of its entries usually can be uniquely determined.121212Prediction becomes a special case of this property.

  • denoising: suppose entries of a sample 𝒙\bm{x}bold_italic_x are perturbed by small noises, they can be effectively removed by projecting 𝒙\bm{x}bold_italic_x back onto the surface.

  • error correction: suppose a small number of unknown entries of 𝒙\bm{x}bold_italic_x are arbitrarily corrupted, they can be effectively and efficiently corrected.

Figure 1.9 illustrates these properties with a low-dimensional linear structure: a one-dimensional line in a two-dimensional plane.

Figure 1.9 : Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).
Figure 1.9 : Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).
Figure 1.9 : Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).
Figure 1.9: Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).

In fact, under mild conditions, the above properties are generalizable to many other low-dimensional structures in high-dimensional spaces [WM22]. Interestingly, as we will see in this book, these useful properties such as completion and denoising will inspire effective methods to learn such low-dimensional structures.

In the above, for simplicity, we have only used the deterministic case to introduce the important notion of predictability and low-dimensionality. Hence, the data (or sampled segments) precisely lie on some geometric structures such as subspaces or surfaces. In practice, as we have alluded to before, there is always a certain level of uncertainty or randomness in the data. In this case, we may assume that the data have a probability distribution, with a probability density function p(𝒙)p(\bm{x})italic_p ( bold_italic_x ). We say that a distribution is “low-dimensional” if its density concentrates around a geometric structure that is rather low-dimensional, say a subspace, a surface or a mixture of them, as shown in Figure 1.7. Notice that, from a practical point of view, such a density function p(𝒙)p(\bm{x})italic_p ( bold_italic_x ), once learned, can serve as a very powerful prior for estimating 𝒙\bm{x}bold_italic_x based on partial, noising, or corrupted observation, say:

𝒚=f(𝒙)+𝒏,\bm{y}=f(\bm{x})+\bm{n},bold_italic_y = italic_f ( bold_italic_x ) + bold_italic_n , (1.2.24)

by computing the conditional estimation 𝒙^(𝒚)=𝔼(𝒙𝒚)\hat{\bm{x}}(\bm{y})=\mathbb{E}(\bm{x}\mid\bm{y})over^ start_ARG bold_italic_x end_ARG ( bold_italic_y ) = blackboard_E ( bold_italic_x ∣ bold_italic_y ) or through sampling the conditional distribution 𝒙^(𝒚)p(𝒙𝒚)\hat{\bm{x}}(\bm{y})\sim p(\bm{x}\mid\bm{y})over^ start_ARG bold_italic_x end_ARG ( bold_italic_y ) ∼ italic_p ( bold_italic_x ∣ bold_italic_y ).131313Modern generative AI technologies such as (conditioned) image generation very much rely on this fact, as we will elaborate on in Chapter 6.

Our discussions above have led to the main and only assumption on which this book will make for a deductive approach to understand intelligence and deep networks in particular:

The Main Assumption: Any intelligent systems or learning methods should and could rely on is that the world is predictable hence the distribution of the observed high-dimensional data samples have low-dimensional supports.

The remaining question is how to learn such low-dimensional structures correctly from the high-dimensional data, via effective and efficient computable means. As we will see in this book, parametric models that were well studied in classic analytical approaches and non-parametric models such as deep networks that are popular in modern practice are merely different means to achieve the same goal.

1.3 How to Learn?

1.3.1 Analytical Approaches

Note that even if a predictive function is tractable to compute, it does not imply that it is tractable or scalable to learn this function from a number of sampled segments. Of course, one classical approach to ensure the problems are tractable or amenable to efficient solutions is to make explicit assumptions about the family of low-dimensional structures we are dealing with. Historically, due to limited computation and data, simple and idealistic analytical models were always the first to be studied as they often offer efficient closed-form or numerical solutions. In addition, they can provide insights into the more general problems and they often already provide useful solutions to important though limited cases. In the old days when computational resources were scarce, analytical models that permitted efficient closed-form or numerical solutions were the only cases that could be implemented. Linear structures became the first classes of models to be thoroughly studied.

For example, arguably the simplest case is to assume the data is distributed around a single low-dimensional subspace in a high-dimensional space. Or somewhat equivalently, one may assume the data is distributed according to an almost degenerate low-dimensional Gaussian. Identifying such a subspace or Gaussian from a finite number of (noisy) samples is then the classical problem of principal component analysis (PCA) and effective algorithms have been developed for this class of models [Jol02]. One can make the family of models increasingly more complex and expressive. For instance, one may assume the data are distributed around a certain mixture of low-dimensional components (subspaces or low-dimensional Gaussians), as in the case of independent component analysis (ICA) [BJC85], dictionary learning (DL), generalized principal component analysis (GPCA) [VMS05], or the even more general class of sparse low-dimensional models that have been studied extensively in recent years in fields such as compressive sensing [WM22].

Around all these analytical model families, the central problem of study is always how to identify the most compact model within each family that best fits the given data. Below, we give a very brief account of these classical analytical models but leave a more systematic study to Chapter 2. In theory, these analytical models have provided us with tremendous insights into the geometric and statistical properties of low-dimensional structures. They often give us closed-form solutions or efficient and scalable algorithms which are very useful for data whose distributions can be well approximated by such models. More importantly, for more general problems, they provide us with a general sense of how easy or difficult the problem of identifying low-dimensional structures can be, and what the basic ideas are to approach such a problem.

Linear Dynamical Systems

Wiener Filter.

As we have discussed before in Section 1.2.1, a main task of intelligence is to learn what is predictable in sequences of observations. Probably the simplest class of predictable sequences, or signals, are generated via a linear time-invariant (LTI) process:

x[n]=h[n]z[n]+ϵ[n],x[n]=h[n]*z[n]+\epsilon[n],italic_x [ italic_n ] = italic_h [ italic_n ] ∗ italic_z [ italic_n ] + italic_ϵ [ italic_n ] , (1.3.1)

where zzitalic_z is the input and hhitalic_h is the impulse response function.141414Normally hhitalic_h is assumed to have certain nice structures, say finite length or banded spectrum. Here ϵ[n]\epsilon[n]italic_ϵ [ italic_n ] is some additive noise in the observations. The problem is that given the input process {z[n]}\{z[n]\}{ italic_z [ italic_n ] } and observations of the output process {x[n]}\{x[n]\}{ italic_x [ italic_n ] }, how to find the optimal h[n]h[n]italic_h [ italic_n ] such that x^[n]=h[n]z[n]\hat{x}[n]=h[n]*z[n]over^ start_ARG italic_x end_ARG [ italic_n ] = italic_h [ italic_n ] ∗ italic_z [ italic_n ] predicts x[n]x[n]italic_x [ italic_n ] in an optimal way. In general, we measure the goodness of the prediction by the minimum mean squared error (MMSE):

minh𝔼[ϵ[n]2]=𝔼[x[n]h[n]z[n]22].\min_{h}\mathbb{E}\big{[}\epsilon[n]^{2}\big{]}=\mathbb{E}\big{[}\|x[n]-h[n]*z[n]\|_{2}^{2}\big{]}.roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT blackboard_E [ italic_ϵ [ italic_n ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ italic_x [ italic_n ] - italic_h [ italic_n ] ∗ italic_z [ italic_n ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1.3.2)

The optimal solution h[n]h[n]italic_h [ italic_n ] is referred to as a (denoising) filter. Norbert Wiener, the same person who initiated the Cybernetics movement, studied this problem in the 1940s and gave an elegant closed-form solution known as the Wiener filter [Wie42, Wie49]. This became one of the most fundamental results in the field of Signal Processing.

Kalman Filter.

The idea of denoising or filtering for a dynamic process was later extended to a linear time-invariant system described by a (finite-dimensional) state-space model by Rudolph Kalman in the 1960s:

𝒛[n]=𝑨𝒛[n1]+𝑩𝒖[n]+ϵ[n].\bm{z}[n]=\bm{A}\bm{z}[n-1]+\bm{B}\bm{u}[n]+\bm{\epsilon}[n].bold_italic_z [ italic_n ] = bold_italic_A bold_italic_z [ italic_n - 1 ] + bold_italic_B bold_italic_u [ italic_n ] + bold_italic_ϵ [ italic_n ] . (1.3.3)

The problem is how we can estimate the state of the system 𝒛[n]\bm{z}[n]bold_italic_z [ italic_n ] from noisy observations of the form:

𝒙[n]=𝑪𝒛[n]+𝒘[n],\bm{x}[n]=\bm{C}\bm{z}[n]+\bm{w}[n],bold_italic_x [ italic_n ] = bold_italic_C bold_italic_z [ italic_n ] + bold_italic_w [ italic_n ] , (1.3.4)

where 𝒘\bm{w}bold_italic_w is some (white) noise. The optimal causal151515Which means the estimation can only use observations up to the current time step nnitalic_n. Kalman filter is always causal whereas Wiener filter need not be. state estimator that minimizes the MMSE-type prediction error

min𝔼[𝒙[n]𝑪𝒛[n]22]\min\mathbb{E}\big{[}\|\bm{x}[n]-\bm{C}\bm{z}[n]\|_{2}^{2}\big{]}roman_min blackboard_E [ ∥ bold_italic_x [ italic_n ] - bold_italic_C bold_italic_z [ italic_n ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1.3.5)

is given in closed-form by the so-called Kalman filter [Kal60]. This is one of the cornerstones of modern Control Theory because it allows us to estimate the state of a dynamical system from its noisy observations. Then one can subsequently introduce a (linear) state feedback, say of the form 𝒖[n]=𝑭𝒛^[n]\bm{u}[n]=\bm{F}\hat{\bm{z}}[n]bold_italic_u [ italic_n ] = bold_italic_F over^ start_ARG bold_italic_z end_ARG [ italic_n ], and make the closed-loop system fully autonomous, as we saw in equation (1.2.13).

Identification of Linear Dynamical Systems.

To derive the Kalman filter, the system parameters (𝑨,𝑩,𝑪)(\bm{A},\bm{B},\bm{C})( bold_italic_A , bold_italic_B , bold_italic_C ) are assumed to be known. If they are not given in advance, it would be a more challenging problem known as system identification: how to learn the parameters (𝑨,𝑩,𝑪)(\bm{A},\bm{B},\bm{C})( bold_italic_A , bold_italic_B , bold_italic_C ) from (many samples of) the input sequence {𝒖[n]}\{\bm{u}[n]\}{ bold_italic_u [ italic_n ] } and observation sequence {𝒙[n]}\{\bm{x}[n]\}{ bold_italic_x [ italic_n ] }. This is a classic problem in systems theory. If the system is linear, it can be shown that the input and output sequences {𝒖[n],𝒙[n]}\{\bm{u}[n],\bm{x}[n]\}{ bold_italic_u [ italic_n ] , bold_italic_x [ italic_n ] } would jointly lie on a certain low-dimensional subspace161616which has the same dimension as the order of the state-space model (1.3.3). . Hence the identification problem is essentially equivalent to identifying this low-dimensional subspace [VM96, LV09, LV10].

Note that the above problems have two things in common: first, the (noise-free) sequences or signals are assumed to be generated by an explicit family of parametric models; second, these models essentially are all linear. So conceptually, let 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT be a random variable whose “true” distribution is supported on a low-dimensional linear subspace, say SSitalic_S. To a large extent, Wiener filter and Kalman filter all try to estimate such an 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT from its noisy observations:

𝒙=𝒙o+ϵ,𝒙oS,\bm{x}=\bm{x}_{o}+\bm{\epsilon},\quad\bm{x}_{o}\sim S,bold_italic_x = bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + bold_italic_ϵ , bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∼ italic_S , (1.3.6)

where ϵ\bm{\epsilon}bold_italic_ϵ is typically a random Gaussian noise (or process). Hence, essentially, their solutions all rely on identifying a low-dimensional linear subspace that best fits the observed (noisy) data. Then by projecting the data onto this subspace, one obtains the optimal denoising operations, all in closed form.

Linear and Mixed Linear Models

Principal Component Analysis.

From the above problems in classical signal processing and system identification, we see that the main task behind of all these problems is to learn from noisy observations a single low-dimensional linear subspace. Mathematically, we may model such a structure as:

𝒙=𝒖1z1+𝒖2z2++𝒖dzd+ϵ=𝑼𝒛+ϵ,𝑼D×d\bm{x}=\bm{u}_{1}z_{1}+\bm{u}_{2}z_{2}+\cdots+\bm{u}_{d}z_{d}+\bm{\epsilon}=\bm{U}\bm{z}+\bm{\epsilon},\quad\bm{U}\in\mathbb{R}^{D\times d}bold_italic_x = bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ⋯ + bold_italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + bold_italic_ϵ = bold_italic_U bold_italic_z + bold_italic_ϵ , bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT (1.3.7)

where ϵD\bm{\epsilon}\in\mathbb{R}^{D}bold_italic_ϵ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is some small random noise. Figure 1.10 illustrates such a distribution with two principal components.

Figure 1.10 : A distribution with two principal components.
Figure 1.10: A distribution with two principal components.

The problem is to find the subspace basis 𝑼\bm{U}bold_italic_U from many samples of 𝒙\bm{x}bold_italic_x. A typical approach to estimate the subspace 𝑼\bm{U}bold_italic_U is to minimize the variance of the noise, also known as the minimum mean square error (MMSE):

min𝑼𝔼[ϵ22]=𝔼[𝒙𝑼𝒛22].\min_{\bm{U}}\mathbb{E}\big{[}\|\bm{\epsilon}\|_{2}^{2}\big{]}=\mathbb{E}\big{[}\|\bm{x}-\bm{U}\bm{z}\|_{2}^{2}\big{]}.roman_min start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ bold_italic_x - bold_italic_U bold_italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1.3.8)

Notice that this is essentially a denoising task: once the basis 𝑼\bm{U}bold_italic_U is correctly found, we can denoise the noisy sample 𝒙\bm{x}bold_italic_x by projecting it onto the low-dimensional subspace spanned by 𝑼\bm{U}bold_italic_U as

𝒙𝒙^=𝑼𝑼𝒙.\bm{x}\rightarrow\hat{\bm{x}}=\bm{U}\bm{U}^{\top}\bm{x}.bold_italic_x → over^ start_ARG bold_italic_x end_ARG = bold_italic_U bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x . (1.3.9)

If the noise is small and if we learned the correct low-dimensional subspace 𝑼\bm{U}bold_italic_U, we should expect 𝒙𝒙^\bm{x}\approx\hat{\bm{x}}bold_italic_x ≈ over^ start_ARG bold_italic_x end_ARG. That is, PCA is a special case of the auto-encoding:

𝒙𝑼𝒛𝑼𝒙^.\bm{x}\xrightarrow{\hskip 5.69054pt\bm{U}^{\top}\hskip 5.69054pt}\bm{z}\xrightarrow{\hskip 5.69054pt\bm{U}\hskip 5.69054pt}\hat{\bm{x}}.bold_italic_x start_ARROW start_OVERACCENT bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_z start_ARROW start_OVERACCENT bold_italic_U end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_x end_ARG . (1.3.10)

Only here because of the simple data structure, the encoder \mathcal{E}caligraphic_E and decoder 𝒟\mathcal{D}caligraphic_D both become simple linear operators ((projecting and lifting).

This is a classic problem in statistics known as the Principal Component Analysis (PCA). It was first studied by Pearson in 1901 [Pea01] and later independently by Hotelling in 1933 [Hot33]. This topic is systematically summarized in the classic book [Jol86, Jol02]. In addition, one may explicitly assume the data 𝒙\bm{x}bold_italic_x is distributed according to a single low-dimensional Gaussian:

𝒙𝒩(𝟎,𝑼𝑼+σ𝑰),𝑼D×d,\bm{x}\sim\mathcal{N}(\bm{0},\bm{U}\bm{U}^{\top}+\sigma\bm{I}),\quad\bm{U}\in\mathbb{R}^{D\times d},bold_italic_x ∼ caligraphic_N ( bold_0 , bold_italic_U bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ bold_italic_I ) , bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT , (1.3.11)

which is equivalent to assuming that 𝒛\bm{z}bold_italic_z in the above PCA model (1.3.7) is a standard normal distribution. This is known as Probabilistic PCA [TB99].

In this book, we will revisit PCA in Chapter 2, from the perspective of learning a low-dimensional distribution. Our goal is to use this simple and idealistic model to convey some of the most fundamental ideas for learning a compact representation for a low-dimensional distribution, including the important notion of compression via denoising and autoencoding for a consistent representation.

Independent Component Analysis.

Independent component analysis (ICA) was originally proposed by [BJC85] as a classic model for learning a good representation. In fact it was originally proposed as a simple mathematical model for our memory. The ICA model takes a deceivingly similar form as the above PCA model (1.3.7) by assuming that the observed random variable 𝒙\bm{x}bold_italic_x is a linear superposition of multiple independent components ziz_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

𝒙=𝒖1z1+𝒖2z2++𝒖dzd+ϵ=𝑼𝒛+ϵ.\bm{x}=\bm{u}_{1}z_{1}+\bm{u}_{2}z_{2}+\cdots+\bm{u}_{d}z_{d}+\bm{\epsilon}=\bm{U}\bm{z}+\bm{\epsilon}.bold_italic_x = bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ⋯ + bold_italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + bold_italic_ϵ = bold_italic_U bold_italic_z + bold_italic_ϵ . (1.3.12)

However, here the components ziz_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are assumed to be independent non-Gaussian variables. For example, a popular choice is

zi=σiwi,σiB(1,p),z_{i}=\sigma_{i}\cdot w_{i},\quad\sigma_{i}\sim B(1,p),italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_B ( 1 , italic_p ) , (1.3.13)

where σi\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a Bernoulli random variable and wiw_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT could be a constant value or another random variable, say Gaussian.171717Even if wwitalic_w is Gaussian, σw\sigma witalic_σ italic_w is no longer a Gaussian variable! The ICA problem is trying to identify both 𝑼\bm{U}bold_italic_U and 𝒛\bm{z}bold_italic_z from observed samples of 𝒙\bm{x}bold_italic_x. Figure 1.11 illustrates the difference between ICA and PCA.

Figure 1.11 : PCA (left) versus ICA (right).
Figure 1.11: PCA (left) versus ICA (right).

Although the (decoding) mapping from 𝒛\bm{z}bold_italic_z to 𝒙\bm{x}bold_italic_x seems linear and easy once 𝑼\bm{U}bold_italic_U and 𝒛\bm{z}bold_italic_z are learned, the (encoding) mapping from 𝒙\bm{x}bold_italic_x to 𝒛\bm{z}bold_italic_z can be complicated and may not be represented by a simple linear mapping. Hence ICA generally gives an autoencoding of the form:

𝒙𝒛𝑼𝒙^.\bm{x}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{z}\xrightarrow{\hskip 5.69054pt\bm{U}\hskip 5.69054pt}\hat{\bm{x}}.bold_italic_x start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_z start_ARROW start_OVERACCENT bold_italic_U end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_x end_ARG . (1.3.14)

Hence, unlike PCA, ICA is a little more difficult to analyze and solve. In the 1990s, researchers like Erkki Oja and Aapo Hyvärinen [HO97, HO00a] made significant theoretical and algorithmic contributions to ICA. In Chapter 2, we will study and give a solution to ICA from which the encoding mapping \mathcal{E}caligraphic_E will become clear.

Sparse Structures and Compressive Sensing.

As one may see, if ppitalic_p in (1.3.13) is very small, the probability that any of the components is non-zero is small. In this case, we say 𝒙\bm{x}bold_italic_x is sparsely generated and it concentrates on a set of linear subspaces of dimension k=pdk=p\cdot ditalic_k = italic_p ⋅ italic_d. Hence, to some extent, we may extend the above ICA model to a more general family of low-dimensional structures known as sparse models.

A kkitalic_k-sparse model is defined as consisting of the set of all kkitalic_k-sparse vectors:

𝒵={𝒛n𝒛0k},\mathcal{Z}=\{\bm{z}\in\mathbb{R}^{n}\mid\|\bm{z}\|_{0}\leq k\},caligraphic_Z = { bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ ∥ bold_italic_z ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_k } , (1.3.15)

where 0\|\cdot\|_{0}∥ ⋅ ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the 0\ell^{0}roman_ℓ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT-norm that counts the number of non-zero entries in a vector 𝒛\bm{z}bold_italic_z. That is, 𝒵\mathcal{Z}caligraphic_Z is the union of all kkitalic_k-dimensional subspaces that align with the coordinate axes, as illustrated in Figure 1.7 left. One important problem in classic signal processing and statistics is how to recover a sparse vector 𝒛\bm{z}bold_italic_z from its linear observations:

𝒙=𝑨𝒛+ϵ,𝑨m×n\bm{x}=\bm{A}\bm{z}+\bm{\epsilon},\quad\bm{A}\in\mathbb{R}^{m\times n}bold_italic_x = bold_italic_A bold_italic_z + bold_italic_ϵ , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT (1.3.16)

where 𝑨\bm{A}bold_italic_A is given but typically m<nm<nitalic_m < italic_n and ϵ\bm{\epsilon}bold_italic_ϵ is some small noise. This seemingly benign problem turns out to be NP-hard to compute and it is even hard to approximate (see the book [WM22] for details).

So despite a very rich and long history of study which can be dated back as early as the 18th century [Bos50], there was no provably efficient algorithm to solve this class of problems, although many heuristic algorithms have been proposed and developed between the 1960s and the 1990s. Some are rather effective in practice but without any rigorous justification. A major breakthrough came in the early 2000s when a few renowned mathematicians including David Donoho, Emmanuel Candès, and Terence Tao [Don05, CT05a, CT05] established a rigorous theoretical framework that allows us to characterize precise conditions under which the sparse recovery problem can be solved effectively and efficiently, say via a convex 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT minimization:

min𝒛1subject to𝒙𝑨𝒛2ϵ,\min\|\bm{z}\|_{1}\quad\mbox{subject to}\quad\|\bm{x}-\bm{A}\bm{z}\|_{2}\leq\epsilon,roman_min ∥ bold_italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT subject to ∥ bold_italic_x - bold_italic_A bold_italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ , (1.3.17)

where 1\|\cdot\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the sparsity-promoting 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT norm of a vector and ϵ\epsilonitalic_ϵ is some small constant. Any solution to this problem essentially gives a sparse coding mapping:

𝒙𝒛.\bm{x}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{z}.bold_italic_x start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_z . (1.3.18)

We will give a brief account of such an algorithm, hence mapping, in Chapter 2 which will reveal interesting fundamental connections between sparse coding and deep learning.181818Although similarities between algorithms for sparse coding and deep networks were noticed as early as in 2010 [GL10].

As it turns out, conditions under which 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT minimization succeeds are surprisingly general. The minimum number of measurements mmitalic_m required for a successful recovery is only proportional to the intrinsic dimension of the data kkitalic_k. This is now known as the compressive sensing phenomenon [Can06]. Moreover, this phenomenon is not unique to sparse structures. It also applies to very broad families of low-dimensional structures such as low-rank matrices etc. These results have fundamentally changed our understanding about recovering low-dimensional structures in high-dimensional spaces. This dramatic reversal of fortune with high-dimensional data analysis was even celebrated as the “blessing of dimensionality” by David Donoho [D ̵D00] as opposed to the typical pessimistic belief in the “curse of dimensionality” for high-dimensional problems. This coherent and complete body of results has been systematically organized in the book [WM22].

From a computational perspective, one cannot overestimate the significance of this new framework. It has fundamentally changed our views about an important class of problems previously believed to be largely intractable. It has enabled us to develop extremely efficient algorithms that scale gracefully with the dimension of the problem, hence making the problem of sparse recovery from:

intractabletractablescalable.\mbox{{intractable}}\;\Longrightarrow\;\mbox{{tractable}}\;\Longrightarrow\;\mbox{{scalable}}.intractable ⟹ tractable ⟹ scalable . (1.3.19)

These algorithms also come with rigorous theoretical guarantees of their correctness given precise requirements in data and computation. The rigorous and precise nature of this approach is almost opposite to that of practicing deep neural networks, which is largely empirical. Yet, despite their seemingly opposite styles and standards, we now know both approaches share a common goal: trying to pursue low-dimensional structures in high-dimensional spaces.

Dictionary Learning.

Conceptually, an even harder problem than the sparse coding problem (1.3.16) is when the observation matrix 𝑨\bm{A}bold_italic_A is not even known in advance and we need to learn 𝑨\bm{A}bold_italic_A from a set of (possibly noisy) observations, say 𝑿=[𝒙1,𝒙2,,𝒙n]\bm{X}=[\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{n}]bold_italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]:

𝑿=𝑨𝒁+𝑬,𝑨m×n.\bm{X}=\bm{A}\bm{Z}+\bm{E},\quad\bm{A}\in\mathbb{R}^{m\times n}.bold_italic_X = bold_italic_A bold_italic_Z + bold_italic_E , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT . (1.3.20)

Here we are given only 𝑿\bm{X}bold_italic_X but not the corresponding 𝒁=[𝒛1,𝒛2,,𝒛n]\bm{Z}=[\bm{z}_{1},\bm{z}_{2},\ldots,\bm{z}_{n}]bold_italic_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] nor the noise term 𝑬=[ϵ1,ϵ2,,ϵn]\bm{E}=[\bm{\epsilon}_{1},\bm{\epsilon}_{2},\ldots,\bm{\epsilon}_{n}]bold_italic_E = [ bold_italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], except for knowing 𝒛i\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s are sparse. This is known as the dictionary learning problem, which can be viewed as a generalization of the ICA problem (1.3.12) discussed earlier. In other words, given the distribution of the data 𝑿\bm{X}bold_italic_X to be the image of a standard sparse distribution 𝒁\bm{Z}bold_italic_Z under a linear transform 𝑨\bm{A}bold_italic_A, we would like to learn 𝑨\bm{A}bold_italic_A and its “inverse” mapping \mathcal{E}caligraphic_E such that we obtain an autoencoder:

𝑿𝒁𝑨𝑿?\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\bm{A}\hskip 5.69054pt}\bm{X}?bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT bold_italic_A end_OVERACCENT → end_ARROW bold_italic_X ? (1.3.21)

One can see what is in common with PCA, ICA, and Dictionary Learning is that they all assume that the distribution of the data is supported around low-dimensional linear or mixed linear structures. They all require learning a (global or local) basis of the linear structures, from probably noisy samples of the distribution. In Chapter 2, we will study how to identify low-dimensional structures through these classical models. In particular, we will see an interesting and important fact: all these low-dimensional (piecewise) linear models can be learned effectively and efficiently by the same type of fast algorithms, known as power iteration [ZMZ+20]. Although the above linear or mixed linear models are somewhat too simplistic or idealistic for most real-world data, understanding these models is an important first step toward understanding more general low-dimensional distributions.

General Distributions

The distributions of real-world data such as images, videos, and audio are too complex to be modeled by the above, somewhat idealistic, linear models or Gaussian processes. We normally do not know a priori they are generated from which family of parametric models.191919Although in history there had been many attempts to develop analytical models for these data, such as random fields or stochastic processes for imagery data [MG99], as we have discussed in the previous section. In practice, we typically only have many samples from their distributions – the empirical distributions. Obviously, in such cases, we normally cannot expect to have any closed-form solutions for their low-dimensional structures, nor for the resulting denoising operators.202020Unlike the cases of PCA, Wiener filter, and Kalman filter. So we need to develop a more general solution to these empirical distributions, not necessarily in closed form but at least efficiently computable. If we did this correctly, solutions to the aforementioned linear models should become their special cases.

Denoising.

In the 1950s, statisticians became interested in the problem of denoising data drawn from an arbitrary distribution. Let 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT be a random variable with probability density function po()p_{o}(\cdot)italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( ⋅ ). So if we observe a noisy version of 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT:

𝒙=𝒙o+σ𝒈,\bm{x}=\bm{x}_{o}+\sigma\bm{g},bold_italic_x = bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_σ bold_italic_g , (1.3.22)

where 𝒈𝒩(𝟎,𝑰)\bm{g}\sim\operatorname{\mathcal{N}}(\bm{0},\bm{I})bold_italic_g ∼ caligraphic_N ( bold_0 , bold_italic_I ) is standard Gaussian noise and σ\sigmaitalic_σ is the noise level of the observation. Let p()p(\cdot)italic_p ( ⋅ ) be the probability density function of 𝒙\bm{x}bold_italic_x,212121That is, p(𝒙)=φσ(𝒙𝒛)po(𝒛)d𝒛p(\bm{x})=\int_{-\infty}^{\infty}\varphi_{\sigma}(\bm{x}-\bm{z})p_{o}(\bm{z})\mathrm{d}\bm{z}italic_p ( bold_italic_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_italic_x - bold_italic_z ) italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( bold_italic_z ) roman_d bold_italic_z, where φσ\varphi_{\sigma}italic_φ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is the density function of the Gaussian distribution 𝒩(𝟎,σ2𝑰)\mathcal{N}(\bm{0},\sigma^{2}\bm{I})caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ). Amazingly, the posterior expectation of 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT given 𝒙\bm{x}bold_italic_x can be calculated by an elegant formula, known as Tweedie’s formula [Rob56]:222222Herbert Robbins gave the credit for this formula to Maurice Kenneth Tweedie from their personal correspondence.

𝒙^o=𝔼[𝒙o𝒙]=𝒙+σ2logp(𝒙).\hat{\bm{x}}_{o}=\mathbb{E}[\bm{x}_{o}\mid\bm{x}]=\bm{x}+\sigma^{2}\nabla\log p(\bm{x}).over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = blackboard_E [ bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∣ bold_italic_x ] = bold_italic_x + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_log italic_p ( bold_italic_x ) . (1.3.23)

As can be seen from the formula, the function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) plays a very special role in denoising the observation 𝒙\bm{x}bold_italic_x here. The noise 𝒈\bm{g}bold_italic_g can be explicitly estimated as

𝒈^=𝒙𝒙^oσ=σlogp(𝒙),\hat{\bm{g}}=\frac{\bm{x}-\hat{\bm{x}}_{o}}{\sigma}=-\sigma\nabla\log p(\bm{x}),over^ start_ARG bold_italic_g end_ARG = divide start_ARG bold_italic_x - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG italic_σ end_ARG = - italic_σ ∇ roman_log italic_p ( bold_italic_x ) , (1.3.24)

for which we only need to know the distribution p()p(\cdot)italic_p ( ⋅ ) of 𝒙\bm{x}bold_italic_x not the ground truth po()p_{o}(\cdot)italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( ⋅ ) for 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. An important implication of this result is that if we add Gaussian noise to any distribution, the denoising process can be done easily if we can somehow get hold of the function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ).

Because this is such an important and useful result, it has been rediscovered and used in many different contexts and areas. For example, after Tweedie’s formula [Rob56], it was rediscovered a few years later by [Miy61]. In the early 2000s, the function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) was rediscovered again in the context of learning a general distribution and was named the “score function” by Aapo Hyvärinen [Hyv05]. But its connection to (empirical Bayesian) denoising was soon recognized by [Vin11]. Generalizations to other measurement distributions (beyond Gaussian noise) have been made by Eero Simoncelli’s group [RS11], and later applied to image denoising [KS21, HJA20].

Figure 1.12 : Geometric interpretation of a score function ∇ log ⁡ p ​ ( 𝒙 ) \nabla\log p(\bm{x}) ∇ roman_log italic_p ( bold_italic_x ) for a distribution with density p ​ ( 𝒙 ) p(\bm{x}) italic_p ( bold_italic_x ) on the left. The operation generated by the score function pushes the distribution towards areas of higher density. The goal is that, by a certain measure of compactness (e.g. entropy or coding length), the resulting distribution is more “compressed”. Eventually, the distribution converges to one that has a lower-dimensional support, as p ∗ ​ ( 𝒙 ) p^{*}(\bm{x}) italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) shown on the right.
Figure 1.12: Geometric interpretation of a score function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) for a distribution with density p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) on the left. The operation generated by the score function pushes the distribution towards areas of higher density. The goal is that, by a certain measure of compactness (e.g. entropy or coding length), the resulting distribution is more “compressed”. Eventually, the distribution converges to one that has a lower-dimensional support, as p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) shown on the right.
Entropy minimization.

In fact, this function has a very intuitive information-theoretic and geometric interpretation. Note that in information theory logp(𝒙)-\log p(\bm{x})- roman_log italic_p ( bold_italic_x ) generally corresponds to the number of bits needed to encode 𝒙\bm{x}bold_italic_x232323at least in the case of a discrete variable, as we will explain in more detail in Chapter 3.. The gradient logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) points to a direction in which the density is higher, as shown in Figure 1.12 left. The number of bits needed to encode 𝒙\bm{x}bold_italic_x decreases if it moves in that direction. Hence, the overall effect of the operator logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) is to push the distribution to “shrink” towards areas of higher density. Actually, one can formally show that the (differential) entropy of the distribution

H(𝒙)=p(𝒘)logp(𝒘)d𝒘H(\bm{x})=-\int p(\bm{w})\log p(\bm{w})\mathrm{d}\bm{w}italic_H ( bold_italic_x ) = - ∫ italic_p ( bold_italic_w ) roman_log italic_p ( bold_italic_w ) roman_d bold_italic_w (1.3.25)

indeed decreases under such an operation (see Chapter 3 and Appendix B). Therefore, if we encode it with an optimal code book, the overall coding length/rate of the resulting distribution is reduced, hence more “compressed.”. Intuitively, one can imagine that, if we repeat such a denoising process indefinitely, the distribution will eventually shrink to one whose mass is concentrated on a support of lower dimension. For example, the distribution p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) shown on the left of Figure 1.12, under the action of the score function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ), eventually will converge to the distribution p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) on the right242424Strictly speaking, p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) is a distribution whose density is a generalized function: p(𝒙)=p(𝒙1)δ(𝒙𝒙1)+p(𝒙2)δ(𝒙𝒙2)p^{*}(\bm{x})=p^{*}(\bm{x}_{1})\delta(\bm{x}-\bm{x}_{1})+p^{*}(\bm{x}_{2})\delta(\bm{x}-\bm{x}_{2})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_δ ( bold_italic_x - bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_δ ( bold_italic_x - bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), with p(𝒙1)+p(𝒙2)=1p^{*}(\bm{x}_{1})+p^{*}(\bm{x}_{2})=1italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1. :

H(𝒙)=p(𝒘)logp(𝒘)d𝒘decreasingH(𝒙)=p(𝒘)logp(𝒘)d𝒘.H(\bm{x})=-\int p(\bm{w})\log p(\bm{w})\mathrm{d}\bm{w}\quad\xrightarrow{\hskip 2.84526pt\mbox{decreasing}\hskip 2.84526pt}\quad H^{*}(\bm{x})=-\int p^{*}(\bm{w})\log p^{*}(\bm{w})\mathrm{d}\bm{w}.italic_H ( bold_italic_x ) = - ∫ italic_p ( bold_italic_w ) roman_log italic_p ( bold_italic_w ) roman_d bold_italic_w start_ARROW start_OVERACCENT decreasing end_OVERACCENT → end_ARROW italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) = - ∫ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_w ) roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_w ) roman_d bold_italic_w . (1.3.26)

Strictly speaking, as the distribution converges to p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ), its differential entropy converges to negative infinity. This is due to a technical difference between the definition of differential entropy for continuous random variables and discrete random variables, respectively. We will see how to resolve this technical difficulty in Chapter 3, using a more uniform measure of rate distortion.

We will discuss later in this chapter and in Chapter 3 how such a seemingly simple concept of denoising and compression leads to a very unifying and powerful method for learning general low-dimensional distributions in a high-dimensional space, including the distribution of natural images.

1.3.2 Empirical Approaches

In practice, for many important real-world data such as images, sounds, and texts, it is difficult to model them with idealistic linear or mixed linear models. For example, there has been a long and rich history in the fields of image processing and computer vision that tries to model the distribution of natural images analytically. David Mumford, a Fields Medalist, spent considerable effort in the 1990s trying to understand and model the statistics of natural images [Mum96]. He and his students, including Song-Chun Zhu, drew inspiration and techniques from statistical physics and proposed many statistical and stochastic models for the distribution of natural images [ZWM97, ZM97a, ZM97, HM99, MG99, LPM03]. However, these analytical models met with limited success in producing samples that closely resemble natural images. Obviously, for real-world data like images, we need to develop more powerful and unifying methods to pursue their more general low-dimensional structures.

Hence, historically, many empirical models have been proposed to model important real-world data, including images and texts. These models often drew inspiration from the characteristics of the biological nerve system because the brain of an animal or human seems to process these data extremely efficiently and effectively.

Classic Artificial Neural Networks

Artificial neuron.
Figure 1.13 : The first mathematical model of an artificial neuron (right) that emulates how a neuron (left) processes signals.
Figure 1.13 : The first mathematical model of an artificial neuron (right) that emulates how a neuron (left) processes signals.
Figure 1.13: The first mathematical model of an artificial neuron (right) that emulates how a neuron (left) processes signals.

Inspired by the nerve system in the brain, the mathematical model of the first artificial neuron252525known as the Linear Threshold Unit, or a perceptron. was proposed by Warren McCulloch262626A professor of psychiatry at the University of Chicago at the time and Walter Pitts in 1943 [MP43]. It describes the relationship between the input xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and output ojo_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as:

oj=φ(iwjixi),o_{j}=\varphi\Big{(}\sum_{i}w_{ji}x_{i}\Big{)},italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_φ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1.3.27)

where φ()\varphi(\cdot)italic_φ ( ⋅ ) is some nonlinear activation, normally modeled by a threshold function. This model is illustrated in Figure 1.13. As we can see, this form already shares the main characteristics of a basic unit in modern deep neural networks. The model is derived from observations of how a single neuron works in our nerve system. However, people did not know exactly what functions a collection of such neurons wants to realize and perform. On a more technical level, neither were they sure what nonlinear activation function φ()\varphi(\cdot)italic_φ ( ⋅ ) should be used. Hence, historically many variants have been proposed.272727Step function, Hard or soft thresholding, Rectified Linear Unit (ReLU), sigmoid, etc.

Figure 1.14 : A network with one single hidden layer (left) versus a deep network (right).
Figure 1.14: A network with one single hidden layer (left) versus a deep network (right).
Artificial neural network.

In the 1950s, Frank Rosenblatt was the first to build a machine with a network of such artificial neurons, shown in Figure 1.15. The machine is called Mark I Perceptron which consists of an input layer, an output layer, and a single hidden layer consisting of 512 artificial neurons, as shown in Figure 1.15 left, which is similar to what is illustrated in Figure 1.14 left. It was designed to classify optical images of letters. However, the capacity of a single-layer network is limited and can only learn linearly separable patterns. In a 1969 book Perceptrons: An Introduction to Computational Geometry by Marvin Minsky and Seymour Papert [MP69], it was shown that the single-layer architecture of Mark I Perceptron cannot learn an XOR function. This result had significantly damped people’s interest in artificial neural networks, even though it was later proven that a multi-layer network is able to learn an XOR function [RHW86]. In fact, a (big enough) multi-layer network, as shown in Figure 1.14 right, consisting of such simple neurons can simulate any finite-state machine, even the universal Turing machine.282828Do not confuse what neural networks are capable of doing in principle with whether it is tractable or easy to learn a neural network that realizes certain desired functions. Nevertheless, subsequently, the study of artificial neural networks went into its first winter in the 1970s.

Figure 1.15 : The Mark I Perceptron machine developed by Frank Rosenblatt in the late 1950s.
Figure 1.15 : The Mark I Perceptron machine developed by Frank Rosenblatt in the late 1950s.
Figure 1.15: The Mark I Perceptron machine developed by Frank Rosenblatt in the late 1950s.
Convolutional neural networks.

The somewhat disappointing early experimentation with artificial neural networks like Mark I Perceptron in the 50s and 60s suggested that it might not be enough to simply connect neurons in a general fashion as multi-layer perceptrons (MLP). In order to build effective and efficient networks, we need to understand what purpose or function neurons in a network need to achieve collectively so that they should be organized and learned in a certain special way. Once again, the study of machine intelligence turned to drawing inspiration from how the animal’s nervous system works.

It is known that most of our brain is dedicated to processing visual information. In the 1950s and 1960s, David Hubel and Torsten Wiesel systematically studied the visual cortices of cats. It was discovered that the visual cortex contains different types of cells (known as simple cells and complex cells), which are sensitive to visual stimuli of different orientations and locations [HW59]. Hubel and Wiesel won the 1981 Nobel Prize in Physiology or Medicine for their ground-breaking discovery.

On the artificial neural network side, Hubel and Wiesel’s work had inspired Kunihiko Fukushima who designed the “neocognitron” network in 1980 which consists of artificial neurons that emulate biological neurons in the visual cortices [Fuk80]. This is known as the first convolutional neural network (CNN), and its architecture is illustrated in Figure 1.16. Unlike the perceptron, the neocognitron had more than one hidden layer and could be viewed as a deep network, as compared in Figure 1.14 right.

Also inspired by how neurons work in the cat’s visual cortex, he was also the first to introduce the use of rectified linear unit (ReLU):

φ(x)=max{0,x}={x,ifx>0,0,ifx0,\varphi(x)=\max\{0,x\}=\begin{cases}x,&\text{if}\,x>0,\\ 0,\quad&\text{if}\,x\leq 0,\end{cases}italic_φ ( italic_x ) = roman_max { 0 , italic_x } = { start_ROW start_CELL italic_x , end_CELL start_CELL if italic_x > 0 , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if italic_x ≤ 0 , end_CELL end_ROW (1.3.28)

for the activation function φ()\varphi(\cdot)italic_φ ( ⋅ ) in 1969 [Fuk69]. But not until recent years has the ReLU become a widely used activation function in modern deep (convolutional) neural networks. We will learn from this book why this is a good choice once we explain the main operations that deep networks try to implement: compression.

Figure 1.16 : Origin of convolutional neural networks: the Neocognitron by Kunihiko Fukushima in 1980. Notice that the interleaving layers of convolutions and pooling try to emulate the functions of simple cells and complex cells discovered in the visual cortices of cats.
Figure 1.16: Origin of convolutional neural networks: the Neocognitron by Kunihiko Fukushima in 1980. Notice that the interleaving layers of convolutions and pooling try to emulate the functions of simple cells and complex cells discovered in the visual cortices of cats.

CNN-type networks continued to evolve in the 1980s and many different variants had been introduced and studied. However, despite the remarkable capacities of deep networks and the improved architectures inspired by neuroscience, it remained extremely difficult to train such deep networks for a real task such as image classification. How to get a network to work depended on many unexplainable heuristics and tricks that really limited the appeal and applicability of neural networks. One major breakthrough came around 1989 when Yann LeCun successfully used back propagation (BP) to learn a deep convolutional neural network for recognizing hand-written digits [LBD+89], later known as the LeNet (see Figure 1.17). After several years’ persistent development, his perseverance paid off: performance of the LeNet eventually became good enough for practical usage in the late 1990s [LBB+98]: it was used by the US Post Office for recognizing handwritten digits (for zip codes). The LeNet was considered as the “prototype” network for all modern deep neural networks, such as the AlexNet and ResNet, which we will discuss later. Due to this work, Yann LeCun was awarded the 2018 Turing Award.292929Together with two other pioneers of deep networks, Yoshua Bengio and Geoffrey Hinton.

Figure 1.17 : The LeNet-5 convolutional neural network designed by Yann LeCun in 1989.
Figure 1.17: The LeNet-5 convolutional neural network designed by Yann LeCun in 1989.
Backpropagation.

In history, the fate of deep neural networks seems to be tied closely to how they can be trained easily and efficiently. Back propagation (BP) was introduced for this reason. We know that a multiple layer perceptron can be expressed as a composition of a sequence of linear mappings and nonlinear activations as follows:

h(𝑾1,,𝑾L)=fL(𝑾LfL1(𝑾L1f2(𝑾2f1(𝑾1𝒙)))).h(\bm{W}_{1},\ldots,\bm{W}_{L})=f^{L}(\bm{W}_{L}f^{L-1}(\bm{W}_{L-1}\cdots f^{2}(\bm{W}_{2}f^{1}(\bm{W}_{1}\bm{x})))).italic_h ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) = italic_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_x ) ) ) ) . (1.3.29)

In order to train the network weights {𝑾l}l=1L\{\bm{W}_{l}\}_{l=1}^{L}{ bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT to reduce the prediction/classification error based on a gradient descent algorithm, we need to evaluate the gradient h/𝑾l{\partial h}/{\partial\bm{W}_{l}}∂ italic_h / ∂ bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. It has been well known, from the chain rule in calculus, that gradients can be computed efficiently for this type of functions, later known as back propagation (BP). See Appendix A for a detailed description. The technique of back propagation was already known and practiced by people in fields such as optimal control and dynamic programming in the 1960s and 1970s. For example, it appeared in the 1974 PhD thesis of Dr. Paul Werbos [Wer74, Wer94]. In 1986, David Rumelhart et al. were the first to apply back propagation to train a multiple layer perceptron (MLP) network [RHW86]. Since then, BP has become increasingly popular since it gives a scalable algorithm to learn large deep neural networks.303030as it can be efficiently implemented on computing platforms that facilitate parallel and distributed computing. It is now an almost dominant technique for training deep neural networks today. However, it is believed that nature does not learn by back propagation, because such a mechanism is still way too expensive for physical implementation by nature313131As we have discussed earlier, nature almost ubiquitously learns to correct error via closed-loop feedback.. This obviously leaves tremendous room for improvement in the future, as we will discuss more.

However, despite the above algorithmic progress and promising practice in the 1980s, training deep neural networks remained extremely finicky and expensive for computing systems in the 1980s and 1990s. In the late 1990s, support vector machines (SVM) [CV95] had become very popular as they were viewed as a better alternative to neural networks for tasks such as classification.323232In fact, similar ideas to solve classification problems can be traced back to the PhD dissertation work of Thomas Cover, which as condensed and published in a paper in 1964 [Cov64]. There were two main reasons: first, SVM was based on a rigorous statistical learning framework known as the Vapnik–Chervonenkis (VC) theory; and second, it leads to rather efficient algorithms based on convex optimization [BV04]. The rise of SVM had brought a second winter to the study of neural networks around the early 2000s.

Compressive autoencoding.

In the late 1980s and 1990s, artificial neural networks were already adopted to learn low-dimensional representations of high-dimensional data such as images. It had been shown that neural networks can be used to learn PCA from the data [Oja82, BH89], instead of using the classic methods discussed in Section 1.3.1. It was also argued during the late 1980s that due to its capability to model nonlinear transforms, neural networks were suggested to learn low-dimensional representations for data with nonlinear distributions. Similar to the linear PCA case, one can try to simultaneously learn a nonlinear dimension-reduction encoder ffitalic_f and a decoder ggitalic_g, each modeled by a deep neural network [RHW86, Kra91]:

𝑿𝑓𝒁𝑔𝑿^.\bm{X}\xrightarrow{\hskip 5.69054ptf\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054ptg\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_X start_ARROW start_OVERACCENT italic_f end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.3.30)

By enforcing the decoded data 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG to be consistent with the original 𝑿\bm{X}bold_italic_X, say by minimizing an MMSE type reconstruction error333333Although MMSE type of errors are known to be problematic to imagery data that have complex nonlinear structures. As we will soon discuss, much of the recent work in generative methods, including GANs, has been to find surrogates of better distance functions between the original data 𝑿\bm{X}bold_italic_X and the regenerated 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG.:

minf,g𝑿𝑿^22=𝑿g(f(𝑿))22,\min_{f,g}\big{\|}\bm{X}-\hat{\bm{X}}\big{\|}_{2}^{2}=\big{\|}\bm{X}-g(f(\bm{X}))\big{\|}_{2}^{2},roman_min start_POSTSUBSCRIPT italic_f , italic_g end_POSTSUBSCRIPT ∥ bold_italic_X - over^ start_ARG bold_italic_X end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ bold_italic_X - italic_g ( italic_f ( bold_italic_X ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1.3.31)

an autoencoder can be learned from the data 𝑿\bm{X}bold_italic_X themselves.

But how can we guarantee that such an autoencoding indeed captures the true low-dimensional structures in 𝑿\bm{X}bold_italic_X instead of giving a trivial redundant representation? For example, we can simply choose ffitalic_f and ggitalic_g to be the identity map and 𝒁=𝑿\bm{Z}=\bm{X}bold_italic_Z = bold_italic_X. So to ensure the autoencoding to be worthwhile, one wishes the resulting representation to be compressive, in terms of a certain computable measure of complexity. In 1993, Geoffrey Hinton and colleagues proposed to use the coding length as such a measure, hence the objective of autoencoding became to find such a representation that minimizes the coding length [HZ93]. This work also established a fundamental connection between the principle of minimum description length [Ris78] and free (Helmholtz) energy minimization. Later work [HS06] from Hinton’s group showed empirically that such an autoencoding is capable of learning meaningful low-dimensional representations for real-world images. A more comprehensive survey of autoencoders was done by Pierre Baldi in 2011 [Bal11], just before deep networks became popular. We will discuss more about the measure of complexity and autoencoding later in Section 1.4, and give a systematic study of compressive autoencoding in Chapter 5, with a more unifying view.

Figure 1.18 : Architecture of the LeNet [ LBD+89 ] versus that of the AlexNet [ KSH12 ] .
Figure 1.18: Architecture of the LeNet [LBD+89] versus that of the AlexNet [KSH12].

Modern Deep Neural Networks

For nearly 30 years, from the 1980s to the 2010s, for the study of machine learning and machine intelligence, neural networks were not considered seriously by the mainstream. Early (deep) neural networks, such as the LeNet, have shown promising performance for small-scale classification problems such as recognizing digits. However, the design and practice of the networks were rather empirical, datasets available at the time were small, and the BP algorithm was a huge computational burden for computers then. These factors had resulted in the lack of interest in neural networks, and progress had been stagnant, with only a handful of researchers working on it.

Classification and recognition.

As it turns out, the tremendous potential of deep neural networks could only be unleashed once there are enough data and computing power. Fast forward to the 2010s, much larger datasets such as ImageNet became available, and GPUs became powerful enough to make BP much more affordable, even for networks much larger than LeNet. Around 2012, a deep convolutional neural network known as the AlexNet drew attention as it surpassed existing classification methods by a significant margin with the ImageNet dataset [KSH12].343434In fact, before this, deep networks had demonstrated state-of-the-art performance on speech recognition tasks. But it did not receive so much attention until their success with image classification. Figure 1.18 shows the comparison between the AlexNet and the LeNet. The AlexNet shares many common characteristics as the LeNet, only it is bigger and has adopted ReLU as the nonlinear activation instead of the Sigmoid function used in LeNet. Partly due to the influence of this work, Geoffrey Hinton was awarded the 2018 Turing Award.

This early success inspired the machine intelligence community, in the next few years, to explore new variations and improvements to the network design. In particular, people had discovered empirically that the larger and deeper the networks, the better the performance in tasks such as image classification. Many deep network architectures have been tried, tested, and popularized. A few notable ones include VGG [SZ15], GoogLeNet [SLJ+14], ResNet [HZR+16], and more recently Transformers [VSP+17] etc. Despite fast progress in empirical performance, there was a lack of theoretical explanation for these empirically discovered architectures, including relationships among them if any. One purpose of this book is to reveal what common objective all these networks may serve and why they share certain common characteristics, including multiple layers of linear operator interleaved with nonlinear activation (see Chapter 4).

Reinforcement learning.

The early successes of deep networks were mainly for classification tasks in a supervised learning setting, such as speech recognition and image recognition. Deep networks were later adopted by the DeepMind team, led by Demis Hassabis, to learn decision-making or control policies for playing games. In this context, deep networks are used to model the optimal decision/control policy or the associated optimal value function, as shown in Figure 1.19. These network parameters are incrementally optimized 353535Say based on back propagation (BP). based on reward returned from the success or failure of playing the game with the current policy. This learning method is generally referred to as reinforcement learning [SB18], originated from the practice of control systems in the late 1960s [WF65, MM70]. Its earlier roots can be traced back to a much longer and richer history of dynamic programming by Richard Bellman [Bel57] and trial-and-error learning by Marvin Minsky [Min54] in the 1950s.

From an implementation perspective, the combination of deep networks and reinforcement learning turned out to be rather powerful: deep networks can be used to approximate control policy and value function for real-world environments that are difficult to model analytically. This practice eventually led to the AlphaGo system, developed by the company DeepMind, which surprised the world in 2016 by beating a top human player Lee Sedol in the game Go and then the world champion Ke Jie in 2017.363636In 1996, IBM’s Deep Blue system made history by defeating Russian grandmaster Garry Kasparov in chess. It mainly used traditional machine learning techniques, such as tree search and pruning, which were not so scalable and had not been proven successful for more challenging games such as Go for over 20 years.

The success of AlphaGo came as a big surprise to the computing society which generally believes that the state space for search is too prohibitively large to admit any efficient solution, in terms of computation and sample size. The only reasonable explanation for its success is that there must be very good structures in the optimal value/policy function of the game Go. Their intrinsic dimension is not so high and they can be approximated well by a neural network, learnable from not so prohibitively many samples.

Figure 1.19 : AlphaGo: Using deep neural networks to model the optimal policy or the optimal value function for the game Go.
Figure 1.19: AlphaGo: Using deep neural networks to model the optimal policy or the optimal value function for the game Go.
Generation and prediction.

One may view early practices of deep networks in the 2010s as focused more on extracting relevant information from the data 𝑿\bm{X}bold_italic_X and encoding it for certain task-specific representation 𝒁\bm{Z}bold_italic_Z (say 𝒁\bm{Z}bold_italic_Z represents class labels in classification tasks):

𝑿𝑓𝒁.\bm{X}\xrightarrow{\hskip 5.69054ptf\hskip 5.69054pt}\bm{Z}.bold_italic_X start_ARROW start_OVERACCENT italic_f end_OVERACCENT → end_ARROW bold_italic_Z . (1.3.32)

In this setting, the mapping ffitalic_f to be learned does not need to preserve most distributional information about 𝑿\bm{X}bold_italic_X but only the sufficient statistics needed for a specific task. For example, a sample 𝒙\bm{x}bold_italic_x in 𝑿\bm{X}bold_italic_X could be an image of an apple, which is mapped by ffitalic_f to its class label 𝒛=\bm{z}=bold_italic_z = “apple.” The information bottleneck framework [TZ15] was proposed in 2015 to analyze the role of deep networks in such a context.

However, in many modern situations such as those so-called large foundation models, people often need to decode 𝒁\bm{Z}bold_italic_Z to recover the corresponding 𝑿\bm{X}bold_italic_X to a certain degree of precision:

𝒁𝑔𝑿^.\bm{Z}\xrightarrow{\hskip 5.69054ptg\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_Z start_ARROW start_OVERACCENT italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.3.33)

As 𝑿\bm{X}bold_italic_X typically represents data observed from the external world, a good decoder would allow us to simulate or predict what happens in the world. For example, in a “text to image” or “text to video” task, 𝒛\bm{z}bold_italic_z normally represents the texts that describe the content of a desired image 𝒙\bm{x}bold_italic_x. The decoder should be able to generate an 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG that has the same content as 𝒙\bm{x}bold_italic_x. For example, given an object class 𝒛=\bm{z}=bold_italic_z = “apple”, the decoder ggitalic_g should generate an image 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG that looks like an apple, although not necessarily exactly the same as the original 𝒙\bm{x}bold_italic_x.

Generation via discriminative approaches.

In order for the generated images 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG to be similar to the true natural images 𝑿\bm{X}bold_italic_X, we need to be able to evaluate and minimize some distance:

mind(𝑿,𝑿^).\min d(\bm{X},\hat{\bm{X}}).roman_min italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) . (1.3.34)

As it turns out, most theoretically motivated distances are extremely difficult, if not impossible, to compute and optimize for distributions in high-dimensional space but with a low intrinsic dimension.373737This is the case even if a parametric family of distribution for 𝑿\bm{X}bold_italic_X is given. The distance often becomes ill-conditioned or ill-defined for distributions with low-dimensional supports. What could be even worse is that the chosen family might not be able to approximate well the true distribution of interest.

In 2007, Zhuowen Tu, a former student of Song-Chun Zhu, probably disappointed by early analytical attempts to model and generate natural images (discussed earlier), decided to try a drastically different approach. In a paper published in CVPR 2007 [Tu07], he was the first to suggest that one could learn a generative model for images via a discriminative approach. The idea is simple: if it is difficult to evaluate the distance d(𝑿,𝑿^)d(\bm{X},\hat{\bm{X}})italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ), one could try to learn a discriminator dditalic_d to separate 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG from 𝑿\bm{X}bold_italic_X:

𝒁𝑔𝑿^,𝑿𝑑𝟎,𝟏,\bm{Z}\xrightarrow{\hskip 5.69054ptg\hskip 5.69054pt}\hat{\bm{X}},\bm{X}\xrightarrow{\hskip 5.69054ptd\hskip 5.69054pt}\bm{0},\bm{1},bold_italic_Z start_ARROW start_OVERACCENT italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG , bold_italic_X start_ARROW start_OVERACCENT italic_d end_OVERACCENT → end_ARROW bold_0 , bold_1 , (1.3.35)

where 𝟎,𝟏\bm{0},\bm{1}bold_0 , bold_1 indicate an image is generated or not. Intuitively, the harder we could separate 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG and 𝑿\bm{X}bold_italic_X, probably the closer they are.

Tu’s work [Tu07] was the first to demonstrate the feasibility of learning a generative model from a discriminative approach. However, the work adopted traditional methods to generate images and classify distributions (such as boosting), and they were slow and hard to implement. After 2012, deep neural networks became very popular for image classification. In 2014, Ian Goodfellow and colleagues proposed again to generate natural images with a discriminative approach [GPM+14]. They suggested using deep neural networks to model the generator ggitalic_g and the discriminator dditalic_d instead. Moreover, they proposed to learn the generator ggitalic_g and discriminator dditalic_d via a minimax game:

mingmaxd(𝑿,𝑿^),\min_{g}\max_{d}\ell(\bm{X},\hat{\bm{X}}),roman_min start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT roman_ℓ ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) , (1.3.36)

where ()\ell(\cdot)roman_ℓ ( ⋅ ) is some natural loss function associated with the classification. In words the discriminator dditalic_d tries to maximize its success in separating 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG, whereas the generator ggitalic_g tries to do the opposite. For this reason, this approach is named as generative adversarial networks (GANs). It was shown that once trained on a large dataset, GANs can indeed generate photo-realistic images. Partially due to the influence of this work, Yoshua Bengio was awarded the 2018 Turing Award.

The discriminative approach seems to be a rather clever way to bypass a fundamental difficulty in distribution learning. However, rigorously speaking, this approach does not fully resolve this fundamental difficulty at all. It is shown by [GPM+14] that with a properly chosen loss, the minimax formulation is mathematically equivalent to minimize the Jensen-Shannon distance (see [CT91]) between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG. This is known to be a hard problem for two low-dimensional distributions in a high-dimensional space. As a result, in practice, GANs typically rely on many heuristics and engineering tricks and often suffer from instability issues such as mode collapsing.383838Nevertheless, such a minimax formulation provides a practical approximation of the distance. It simplifies the implementation and avoids certain caveats in directly computing the distance. Overall, there has been a lack of theoretical guidance on how to improve the practice of GANs.

Generation via denoising and diffusion.

In 2015, shortly after GAN was introduced and became popular, Surya Ganguli and his students realized and suggested that an iterative denoising process modeled by a deep network can be used to learn a general distribution, such as that of natural images [SWM+15]. Their method was inspired by properties of the special Gaussian and binomial processes, studied by William Feller back in 1949 [Fel49].393939Again, in the magical era of 1940s!

Figure 1.20 : Illustration of an iterative denoising and compressing process that, starting from a Gaussian distribution q 0 = 𝒩 ​ ( 𝟎 , 𝑰 ) q^{0}=\mathcal{N}(\bm{0},\bm{I}) italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_N ( bold_0 , bold_italic_I ) , converges to an arbitrary low-dimensional data distribution q L = p ​ ( 𝒙 ) q^{L}=p(\bm{x}) italic_q start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_p ( bold_italic_x ) .
Figure 1.20: Illustration of an iterative denoising and compressing process that, starting from a Gaussian distribution q0=𝒩(𝟎,𝑰)q^{0}=\mathcal{N}(\bm{0},\bm{I})italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_N ( bold_0 , bold_italic_I ), converges to an arbitrary low-dimensional data distribution qL=p(𝒙)q^{L}=p(\bm{x})italic_q start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_p ( bold_italic_x ).

Soon, denoising operators based on the score function [Hyv05], briefly introduced in Section 1.3.1, were shown to be more general and unified the denoising and diffusion processes and algorithms [SE19, SSK+21, HJA20]. Figure 1.20 gives an illustration of the process that transforms a generic Gaussian distribution q0=𝒩(𝟎,𝑰)q^{0}=\mathcal{N}(\bm{0},\bm{I})italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_N ( bold_0 , bold_italic_I ) to an (arbitrary) empirical distribution p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) by performing a sequence of iterative denoising (or compressing) operations:

𝒛0𝒩(𝟎,𝑰)g0𝒛1g1gL1𝒛Lp(𝒙).\bm{z}^{0}\sim\mathcal{N}(\bm{0},\bm{I})\xrightarrow{\hskip 5.69054ptg^{0}\hskip 5.69054pt}\bm{z}^{1}\xrightarrow{\hskip 5.69054ptg^{1}\hskip 5.69054pt}\cdots\xrightarrow{\hskip 5.69054ptg^{L-1}\hskip 5.69054pt}\bm{z}^{L}\sim p(\bm{x}).bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) start_ARROW start_OVERACCENT italic_g start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_g start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW ⋯ start_ARROW start_OVERACCENT italic_g start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∼ italic_p ( bold_italic_x ) . (1.3.37)

By now, denoising (and diffusion) have replaced GANs and become a mainstream method for learning distributions of images and videos, leading to popular commercial image generating engines such as Midjourney and Stability.ai. In Chapter 3 we will systematically introduce and study the denoising and diffusion method for learning a general low-dimensional distribution.

1.4 A Unifying Approach

So far, we have given a brief account of the main objective and history of machine intelligence and many important ideas and approaches associated with it. In recent years, after the empirical success of deep neural networks, tremendous efforts have been made to develop theoretical frameworks that can help us understand all the empirically designed deep neural networks, either certain seemingly necessary components (e.g., dropout, normalization, attention, etc.) or their overall behaviors (e.g., double descent, neural collapse, etc.).

Partly motivated by this, this book aims to achieve several important and challenging goals:

  • Develop a theoretical framework that would allow us to derive rigorous mathematical interpretation of deep neural networks.

  • Ensure correctness of the learned data distribution and consistency with the learned representation.

  • Demonstrate that the framework can lead to performant architectures and can guide further improvements in practice.

Within the past few years, there is mounting evidence that these goals can indeed be achieved, by leveraging the theory and solutions to the classical analytical low-dimensional models discussed briefly earlier (more thoroughly in Chapter 2) and by integrating fundamental ideas from a few related fields, namely information/coding theory, control/game theory, and optimization. This book aims to give a systematic introduction to this new approach.

1.4.1 Learning Parsimonious Representations

One necessary condition for any learning task to be possible is that the sequences of interest must be computable, at least in the sense of Alan Turing [Tur36]. That is, a sequence can be computed via a program on a typical computer.404040There are indeed well-defined sequences that are not computable. In addition to being computable, we require computation be tractable.414141We do not need to consider predicting things whose computational complexity is intractable, say grows exponentially in the length or dimension of the sequence. That is, the computational cost (space and time) for learning and computing the sequence should not grow exponentially in length. Furthermore, as we see in nature (and in modern practice of machine intelligence), for most practical tasks, an intelligent system needs to learn what is predictable from massive data in a very high-dimensional space, such as from vision, sound, and touch. Hence, for intelligence, we do not need to consider all computable and tractable sequences or structures. We should focus only on predictable sequences and structures which admit scalable realizations of its learning and computing algorithms:

computabletractablescalable.\mbox{{computable}}\;\Longrightarrow\;\mbox{{tractable}}\;\Longrightarrow\;\mbox{{scalable}}.computable ⟹ tractable ⟹ scalable . (1.4.1)

This is because whatever algorithms intelligent beings use to learn useful information must be scalable. More specifically, the computational complexity of the algorithms would better scale gracefully, typically linear or even sublinear, in the size and dimension of the data. On the technical level, this requires that the operations that the algorithms rely on to learn could only utilize oracle information that can be efficiently computed from the data. More specifically, when the dimension is high and the scale is large, the only oracle one could afford to compute is either the first-order geometric information about the data424242such as approximating a nonlinear structure locally with linear subspaces and computing the gradient of an objective function. or the second-order statistic information434343such as covariance or correlation of the data or their features.. The main goal of this book is to develop a theoretical and computational framework within which we can systematically develop efficient and effective solutions or algorithms with such scalable oracles and operations to learn low-dimensional structures from the sampled data and subsequently the predictive function.

Pursuing low-dimensionality via compression.

From the examples of sequences we gave in Section 1.2.1, it is clear that some sequences are easy to model and compute and others are more difficult. Obviously, the computational cost of a sequence depends on how complex the predicting function ffitalic_f is. The higher the degree of regression dditalic_d, the more costly it is to compute. ffitalic_f can be a simple linear function, and it can also be a nonlinear function that can be arbitrarily difficult to specify and compute.

It is reasonable to believe that if a sequence is harder, by whatever measure we may choose, to specify and compute, then it will also be more difficult to learn from its sampled segments. Nevertheless, for any given predictable sequence, there are in fact many, often infinitely many, ways to specify it. For example, for a simple sequence xn+1=axnx_{n+1}=ax_{n}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_a italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we could also define the same sequence with xn+1=axn+bxn1bxn1.x_{n+1}=ax_{n}+bx_{n-1}-bx_{n-1}.italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_a italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_b italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_b italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT . Hence it would be very useful if we could develop an objective and rigorous notion of “complexity” for any given computable sequence.

Andrey Kolmogorov, a Russian mathematician, was one of the first to give a definition of complexity for any computable sequence.444444Many have contributed to this notion of sequence complexity, most notably including Ray Solomonoff and Greg Chaitin. All three are believed to have developed and studied algorithmic information theory independently, Ray Solomonoff in 1960, Andrey Kolmogorov in 1965 [Kol98] and Gregory Chaitin around 1966 [Cha66]. He suggested that among all programs that can compute the same sequence, we may use the length of the shortest program as a measure for its complexity. This idea is very much in line with the famous “Occam’s Razor” principle of parsimony: one always chooses the simplest among all theories that can explain the same observation. To be more precise, let ppitalic_p represent a compute program that can generate a sequence SSitalic_S on a universal computer 𝒰\mathcal{U}caligraphic_U. The Kolmogorov complexity of the sequence SSitalic_S is defined to be:

K(S)=minp:𝒰(p)=SL(p).K(S)=\min_{p\,:\,\mathcal{U}(p)=S}L(p).italic_K ( italic_S ) = roman_min start_POSTSUBSCRIPT italic_p : caligraphic_U ( italic_p ) = italic_S end_POSTSUBSCRIPT italic_L ( italic_p ) . (1.4.2)

Therefore, the complexity of a sequence is measured by how “parsimonously” we can specify or compute it. This definition of complexity, or parsimony, for sequences is of great conceptual importance and has many interesting properties. Historically, it has inspired many profound studies in the theory of computation, especially in the field of algorithmic information theory.

The length of the shortest program can be viewed as the ultimate compression of the sequence considered, providing a quantitative measure of how much we have gained by having learned the correct generative mechanism of the sequence. However, despite its theoretical importance, the Kolmogorov complexity is in general not a computable function [CT91] (even intractable to approximate accurately). As a result, this measure of complexity is of little practical use. Neither can it tell us in advance how difficult it is to learn a given sequence nor can it tell us how well we have learned.

Computable measure of parsimony.

Hence for practical purposes, we need an efficiently computable measure of complexity for sequences that are generated from the same predicting function.454545Note that in practice, we typically care about learning the predicting function ffitalic_f, instead of any particular sequence generated by ffitalic_f. Note that part of the reason why Kolmogorov complexity is not computable is because its definition is non-constructive.

So to introduce a computable measure of complexity, we may take a more constructive approach, as advocated by Claude Shannon through the framework of information theory [Sha48, CT91].464646which has successfully guided the engineering practice of the communication industry for the past over 80 years. In essence, by assuming the sequence SSitalic_S is drawn from a probabilistic distribution p(S)p(S)italic_p ( italic_S ), the so-called entropy of the distribution:474747Here we consider differential entropy as we assume the sequence consists of continuous variables. If it consists of discrete variables, we could consider the entropy: H(S)=ip(si)logp(si).H(S)=-\sum_{i}p(s_{i})\log p(s_{i}).italic_H ( italic_S ) = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

h(S)p(s)logp(s)dsh(S)\doteq-\int p(s)\log p(s)\mathrm{d}sitalic_h ( italic_S ) ≐ - ∫ italic_p ( italic_s ) roman_log italic_p ( italic_s ) roman_d italic_s (1.4.3)

gives a natural measure of its complexity. This measure also has a natural interpretation of the average number of binary bits needed to encode such a sequence, as we will see in Chapter 3.

To illustrate the main ideas of this view, let us take a large number of long sequence segments:

{S1,S2,,Si,,SN}D,\{S_{1},S_{2},\ldots,S_{i},\ldots,S_{N}\}\subset\mathbb{R}^{D},{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , (1.4.4)

generated by a predicting function ffitalic_f. Note that without loss of generality, here we have assumed that all sequences are of the same length DDitalic_D. Therefore, each sequence can be viewed as a vector in D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Secondly, we may introduce a coding scheme (with a code book), denoted as \mathcal{E}caligraphic_E, that converts every segment SiS_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a unique stream of binary bits (Si)\mathcal{E}(S_{i})caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The simplest coding scheme is to fill the space spanned by all segments with ϵ\epsilonitalic_ϵ-balls, as shown in Figure 1.21. We then number all the balls. Each sequence is encoded as the (binary) number of its closest ball. Hence, each segment can be recovered484848up to precision ϵ\epsilonitalic_ϵ as such an encoding scheme is lossy. from its corresponding bit stream.

Figure 1.21 : Comparison between two coding schemes. Image the true distribution of the data is around the two arrowed lines. One can code the samples from the two lines with the a code book consisting of all blue balls; one can also code the samples with a code book consisting of only the green balls. Obviously, the second coding schemes will result in much lower coding length/rate, subject to the same precision.
Figure 1.21: Comparison between two coding schemes. Image the true distribution of the data is around the two arrowed lines. One can code the samples from the two lines with the a code book consisting of all blue balls; one can also code the samples with a code book consisting of only the green balls. Obviously, the second coding schemes will result in much lower coding length/rate, subject to the same precision.

Then the complexity of the predicting function ffitalic_f can be evaluated as the average coding length of all sequences, known as the coding rate:494949One may make this more precise by taking R(f)R(f\mid\mathcal{E})italic_R ( italic_f ∣ caligraphic_E ) to be the expected coding length for all segments of length DDitalic_D.

R(f)=𝔼[L((S))]1Ni=1NL((Si)).R(f\mid\mathcal{E})=\mathbb{E}[L(\mathcal{E}(S))]\approx\frac{1}{N}\sum_{i=1}^{N}L(\mathcal{E}(S_{i})).italic_R ( italic_f ∣ caligraphic_E ) = blackboard_E [ italic_L ( caligraphic_E ( italic_S ) ) ] ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (1.4.5)

Obviously, the coding rate measure will change if one uses a different coding scheme (or a code book). In practice, the better we know the low-dimensional structure around which the segments are distributed, the more efficient a code book we can design, as the example shown in Figure 1.21. Acute readers may have recognized that conceptually the denoising process illustrated in Figure 1.20 resembles closely going from the coding scheme with the blue balls to that with the green ones.

Given two coding schemes 1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the segments, if the difference in the coding rates is positive:

R(f1)R(f2)>0,R(f\mid\mathcal{E}_{1})-R(f\mid\mathcal{E}_{2})>0,italic_R ( italic_f ∣ caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_R ( italic_f ∣ caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > 0 , (1.4.6)

we may say the coding scheme 2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is better. This difference can be viewed as a measure of how much more information 2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has over 1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT about the distribution of the data. To a large extent, the goal of learning ffitalic_f is equivalent to finding the most efficient encoding scheme that minimizes the coding rate:

minR(f).\min_{\mathcal{E}}R(f\mid\mathcal{E}).roman_min start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT italic_R ( italic_f ∣ caligraphic_E ) . (1.4.7)

As we will see in Chapter 3, the achievable minimal rate is closely related to the notion of entropy H(S)H(S)italic_H ( italic_S ) (1.4.3).

Remark 1.1.

The perspective of measuring data complexity with explicit encoding schemes has motivated several learning objectives that were proposed to revise the Kolmogorov complexity for better computability [WD99], including the minimum message length (MML) proposed later in 1968 [WB68] and the minimum description length (MDL) in 1978 [Ris78, HY01]. These objectives normally count the coding length for the coding scheme \mathcal{E}caligraphic_E itself (including its code book) in addition to the data SSitalic_S of interest: L((S))+L()L(\mathcal{E}(S))+L(\mathcal{E})italic_L ( caligraphic_E ( italic_S ) ) + italic_L ( caligraphic_E ). However, if the goal is to learn a finite-sized code book and apply it to a large number of sequences, the amortized cost of the code book can be ignored since

1N(L()+i=1NL((Si)))1Ni=1NL((Si))\frac{1}{N}\Big{(}L(\mathcal{E})+\sum_{i=1}^{N}L(\mathcal{E}(S_{i}))\Big{)}\approx\frac{1}{N}\sum_{i=1}^{N}L(\mathcal{E}(S_{i}))divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( italic_L ( caligraphic_E ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

as NNitalic_N becomes large.

Again, one may view the resulting optimal coding scheme as the one that achieves the best compression of the observed data. In general, compared to the Kolmogorov complexity, the coding length given by any encoding scheme will always be larger:

K(S)<L((S)).K(S)<L(\mathcal{E}(S)).italic_K ( italic_S ) < italic_L ( caligraphic_E ( italic_S ) ) . (1.4.8)

Therefore, minimizing the coding rate/length is essentially to minimize an upper bound of the otherwise uncomputable Kolmogorov complexity.

1.4.2 Learning Informative Representations

Note that if the goal was simply to compress the given data just for the sake of compression, then in theory the optimal codes that approach the Kolmogorov complexity would become nearly random or structureless [Cha66].505050Because any codes with structures can be further compressed. However, our true purpose of learning the predictive function ffitalic_f is to use it repeatedly with ease in future predictions. Hence, while compression allows us to identify the low-dimensional distribution in the data, we would like to encode the distribution in a structured and organized way so that the resulting representation is very informative and efficient to use.515151For example, to sample the distribution under different conditions. Figure 1.22 shows an example that explains intuitively why such a transformation is desired.

Figure 1.22 : Transform the identified low-dimensional data distribution to a more informative and structured representation.
Figure 1.22: Transform the identified low-dimensional data distribution to a more informative and structured representation.

As we will show in Chapter 3, these desired structures in the final representation can be precisely promoted by choosing a natural measure of information gain based on the coding rates of the chosen coding schemes. As we see throughout this book, such an explicit and constructive coding approach provides a powerful computational framework for learning good representations of low-dimensional structures for real-world data, as in many cases of practical importance, the coding length function can be efficiently computed or approximated accurately. In some benign cases, we can even obtain closed-form formulae, e.g., subspace and Gaussian (see Chapter 3).

In addition, such a computational framework leads to a principled approach that naturally reveals the role that deep networks play in this learning process. As we will derive systematically in Chapter 4, the layers of a deep network are trying to perform operations that optimize the objective function of interest in an incremental manner. From this perspective, the role of deep networks can be precisely interpreted as to emulate a certain iterative optimization algorithm, say gradient descent, to optimize the objective of information gain. Layers of the resulting deep architectures can be endowed with precise statistical and geometric interpretation, namely performing incremental compressive encoding and decoding operations. As a result, the derived deep networks become transparent “white boxes” that are mathematically fully explainable.

1.4.3 Learning Consistent Representations

To summarize our discussions so far, let us denote the data as:

𝑿={S1,S2,,Si,,SN}D,\bm{X}=\{S_{1},S_{2},\ldots,S_{i},\ldots,S_{N}\}\subset\mathbb{R}^{D},bold_italic_X = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , (1.4.9)

and let 𝒁=(𝑿)\bm{Z}=\mathcal{E}(\bm{X})bold_italic_Z = caligraphic_E ( bold_italic_X ) be the codes of 𝑿\bm{X}bold_italic_X via some encoder \mathcal{E}caligraphic_E:

𝑿𝒁.\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}.bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z . (1.4.10)

In the machine learning context, 𝒁\bm{Z}bold_italic_Z is often called “features” or “latent representation”. Note that without knowing the underlying distribution of 𝑿\bm{X}bold_italic_X, we do not know which encoder \mathcal{E}caligraphic_E should be that can retain most useful information about the distribution of 𝑿\bm{X}bold_italic_X. In practice, people often start with trying a certain compact encoding scheme that serves well for a specific task. In particular, they would try to learn an encoder that optimizes certain (empirical) measure of parsimony for the learned representation:

minρ(𝒁).\min\rho(\bm{Z}).roman_min italic_ρ ( bold_italic_Z ) . (1.4.11)
Example 1.2.

For example, image classification is such a case: we assign all images in the same class to a single code and images in different classes to different codes, say “one-hot” vectors:

𝒙𝒛{[1,0,0,,0,0],[0,1,0,0,0],,[0,0,0,,0,1].}\bm{x}\mapsto\bm{z}\in\{[1,0,0,\ldots,0,0],\;[0,1,0\ldots,0,0],\;\ldots,\;[0,0,0,\ldots,0,1].\}bold_italic_x ↦ bold_italic_z ∈ { [ 1 , 0 , 0 , … , 0 , 0 ] , [ 0 , 1 , 0 … , 0 , 0 ] , … , [ 0 , 0 , 0 , … , 0 , 1 ] . } (1.4.12)

Now, a classifier f()f(\cdot)italic_f ( ⋅ ) can be modeled as a function that predicts the probability of a given 𝒙\bm{x}bold_italic_x belonging to each of the kkitalic_k classes: 𝒛^=f(𝒙)K\hat{\bm{z}}=f(\bm{x})\in\mathbb{R}^{K}over^ start_ARG bold_italic_z end_ARG = italic_f ( bold_italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Then the “goodness” of a classifier can be measured by the so-called cross entropy:525252The cross entropy can be viewed as a distance measure between the ground truth distribution of 𝒛\bm{z}bold_italic_z and that of the prediction 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG. It can also be viewed as the expected coding length of 𝒛\bm{z}bold_italic_z if we use the optimal code book for 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG to encode 𝒛\bm{z}bold_italic_z. The cross entropy reaches minimum when 𝒛\bm{z}bold_italic_z and 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG have the same distribution.

L(𝒛^,𝒛)=k=1Kzklogz^k,L(\hat{\bm{z}},\bm{z})=\sum_{k=1}^{K}-z_{k}\log\hat{z}_{k},italic_L ( over^ start_ARG bold_italic_z end_ARG , bold_italic_z ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (1.4.13)

where zkz_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT indicates the kkitalic_k-th entry of the vector 𝒛\bm{z}bold_italic_z. As the early practice of deep networks indicated [KSH12], if enough data are given, such an encoding scheme can often be represented by a deep network and learned in an end-to-end fashion by optimizing the cross-entropy. \blacksquare

The cross-entropy loss L(𝒛^,𝒛)L(\hat{\bm{z}},\bm{z})italic_L ( over^ start_ARG bold_italic_z end_ARG , bold_italic_z ) can be viewed as a special measure of parsimony ρ(𝒛)\rho(\bm{z})italic_ρ ( bold_italic_z ) associated with a particular family of encoding schemes that are suitable for classification. However, such an encoding is obviously very lossy. The learned 𝒛\bm{z}bold_italic_z does not contain any other information about 𝒙\bm{x}bold_italic_x except for its class type. For example, by assigning an image with (a code representing) the class label “apple”, we no longer know which specific type apple is in the original image from the label alone.

Of course, the other extreme is to require the coding scheme to be lossless. That is, there is a one-to-one mapping between 𝒙\bm{x}bold_italic_x and its code 𝒛\bm{z}bold_italic_z. However, as we will see in Chapter 3, lossless coding (or compression) is impractical unless 𝒙\bm{x}bold_italic_x is discrete. For a continuous random variable, we may only consider lossy coding schemes so that the coding length for the data can be finite. That is, we only encode the data up to a certain prescribed precision. As we will elaborate more in Chapter 3, lossy coding is not merely a practical choice; it plays a fundamental role in making learning the underlying continuous distribution possible from finite samples of the distribution.

For many purposes of learning, we want the feature 𝒛\bm{z}bold_italic_z, although lossy, to keep more information about 𝒙\bm{x}bold_italic_x than just its class type. In this book, we will introduce a more general measure of parsimony based on coding length/rate associated with a more general family of coding schemes – coding with a mixture of subspaces or Gaussians. This family has the capability to closely approximate arbitrary real-world distributions up to a certain precision. As we will see in Chapter 3 and Chapter 4, such a measure will not only preserve most information about the distribution of 𝑿\bm{X}bold_italic_X but will also promote certain nice desired geometric and statistical structures for the learned representation 𝒁\bm{Z}bold_italic_Z.

Bidirectional Autoencoding for Consistency.

In a broader learning context, the main goal of a compressive coding scheme \mathcal{E}caligraphic_E is to identify the low-dimensional structures in the data 𝑿\bm{X}bold_italic_X so that they can be used to predict things in the original data space. This requires that the learned encoding scheme \mathcal{E}caligraphic_E allows an efficient decoding scheme, denoted as 𝒟\mathcal{D}caligraphic_D. It maps 𝒁\bm{Z}bold_italic_Z, often known as a latent representation, back to the data space:

𝑿𝒁𝒟𝑿^.\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\mathcal{D}\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.4.14)

In general, we call such an encoding and decoding pair (,𝒟)(\mathcal{E},\mathcal{D})( caligraphic_E , caligraphic_D ) an autoencoding. Figure 1.23 illustrates the process of such an autoencoder.

Figure 1.23 : Illustration of the architecture of an autoencoder.
Figure 1.23: Illustration of the architecture of an autoencoder.

Generally, we would prefer that the decoding is approximately an “inverse” to the encoding such that the data (distribution) 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG decoded from 𝒁\bm{Z}bold_italic_Z would be similar to the original data (distribution) 𝑿\bm{X}bold_italic_X to some extent.535353We will make it more precise what we mean by being similar later. If so, we would be able to recover or predict from 𝒁\bm{Z}bold_italic_Z what is going on in the original data space. In this case, we say the pair (,𝒟)(\mathcal{E},\mathcal{D})( caligraphic_E , caligraphic_D ) gives a consistent autoencoding. For most practical purposes, not only do we need such a decoding to exist, but also can be efficiently realized and physically implemented. For example, if 𝒙\bm{x}bold_italic_x is a real-valued variable, quantification is needed in order for any decoding scheme to be realizable on a finite-state machine, as we will explain more in Chapter 3. Hence, in general, one should expect that realizable encoding and decoding schemes are necessarily lossy. A central question is how to learn a compact (lossy) representation 𝒁\bm{Z}bold_italic_Z such that it can be used to predict 𝑿\bm{X}bold_italic_X well.

Generally speaking, as we will see, both encoder and decoder could be modeled and realized by deep networks and learned by solving an optimization problem of the following form:

mind(𝑿,𝑿^)+ρ(𝒁),\min\,d(\bm{X},\hat{\bm{X}})+\rho(\bm{Z}),roman_min italic_d ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) + italic_ρ ( bold_italic_Z ) , (1.4.15)

where d(,)d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is a certain distance function that promotes similarity between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG545454Either sample-wise or distribution-wise similar, depending on the choice of the distance function dditalic_d. and ρ(𝒁)\rho(\bm{Z})italic_ρ ( bold_italic_Z ) is some measure that promotes parsimony and information richness of 𝒁\bm{Z}bold_italic_Z. The classic principal component analysis (PCA) [Jol02] is a typical example of a consistent autoencoding, which we will study in great detail in Chapter 2, as a precursor to more general low-dimensional structures. In Chapter 5, we will study how to learn consistent autoencoding for general (say nonlinear) low-dimensional distributions.

1.4.4 Learning Self-Consistent Representations

Note that in the above autoencoding objective, one needs to evaluate how close or consistent the decoded data 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG is to the original 𝑿\bm{X}bold_italic_X. This often requires some external supervision or knowledge on what similarity measure to use. Computing similarity between 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG and 𝑿\bm{X}bold_italic_X can be very expensive, if not entirely impossible or intractable.555555Say one wants to minimize certain distributional distance between the two. Note that in nature, animals are capable of learning all by themselves without comparing their estimate 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG with the ground truth 𝑿\bm{X}bold_italic_X in the data space. They typically do not even have that option.

Then how is a system able to learn autonomously without external supervision or comparison? How can they know that 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG is consistent with 𝑿\bm{X}bold_italic_X even without directly comparing them? That leads to the idea of “closing the loop”. As it turns out, under the mild conditions that we will make precise in Chapter 5, to ensure 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG are consistent, one only has to encode 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG as 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG and check if 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG are consistent. We call this notion of consistency as self-consistency, which can be illustrated by the following diagram:

𝑿𝒁𝒟𝑿^𝒁^,\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\mathcal{D}\hskip 5.69054pt}\hat{\bm{X}}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\hat{\bm{Z}},bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_Z end_ARG , (1.4.16)

We refer to this process as a closed-loop transcription,565656Inspired by the transcription process between DNA and RNA or other proteins. which is illustrated in Figure 1.24.

Figure 1.24 : Illustration of a closed-loop transcription. Here we use a mapping f f italic_f to represent the encoder ℰ \mathcal{E} caligraphic_E and g g italic_g to represent the decoder 𝒟 \mathcal{D} caligraphic_D .
Figure 1.24: Illustration of a closed-loop transcription. Here we use a mapping ffitalic_f to represent the encoder \mathcal{E}caligraphic_E and ggitalic_g to represent the decoder 𝒟\mathcal{D}caligraphic_D.

It is arguably true that any autonomous intelligent being only needs to learn a self-consistent representation 𝒁\bm{Z}bold_italic_Z of the observed data 𝑿\bm{X}bold_italic_X, because checking consistency in the original data space (often meaning in the external world) is either too expensive or even not physically feasible. The closed-loop formulation allows one to learn an optimal encoding f(,θ)f(\cdot,\theta)italic_f ( ⋅ , italic_θ ) and decoding g(,η)g(\cdot,\eta)italic_g ( ⋅ , italic_η ) via a minmax game that depends only on the internal (learned) feature 𝒁\bm{Z}bold_italic_Z:

maxθminη(𝒁,𝒁^)+ρ(𝒁),\max_{\theta}\min_{\eta}\ell(\bm{Z},\hat{\bm{Z}})+\rho(\bm{Z}),roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT roman_ℓ ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) + italic_ρ ( bold_italic_Z ) , (1.4.17)

where (𝒁,𝒁^)\ell(\bm{Z},\hat{\bm{Z}})roman_ℓ ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) is a loss function based on coding rates of the features 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG, which, as we will see, can be much easier to compute. Here again, ρ(𝒁)\rho(\bm{Z})italic_ρ ( bold_italic_Z ) is some measure that promotes parsimony and information richness of 𝒁\bm{Z}bold_italic_Z. Somewhat surprisingly, as we will see in Chapter 5, under rather mild conditions such as 𝑿\bm{X}bold_italic_X being sufficiently low-dimensional, self-consistency between (𝒁,𝒁^)(\bm{Z},\hat{\bm{Z}})( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) implies consistency in (𝑿,𝑿^)(\bm{X},\hat{\bm{X}})( bold_italic_X , over^ start_ARG bold_italic_X end_ARG )! In addition, we will also see that a closed-loop system will allow us to learn the distribution in a continuous and incremental manner,575757That is to learn with one class at a time or even one sample at a time. without suffering from problems such as catastrophic forgetting associated with open-loop models.

1.5 Bridging Theory and Practice for Machine Intelligence

So far, we have introduced three related frameworks for learning a compact and structured representation 𝒁\bm{Z}bold_italic_Z for a given data distribution 𝑿\bm{X}bold_italic_X:

  • The open-ended encoding (1.4.10);

  • The bi-directional autoencoding (1.4.14);

  • The closed-loop transcription (1.4.16).

In this book, we will systematically study all three frameworks, one after another:

open-endedbi-directionalclosed-loop,\mbox{{open-ended}}\;\Longrightarrow\;\mbox{{bi-directional}}\;\Longrightarrow\;\mbox{{closed-loop}},open-ended ⟹ bi-directional ⟹ closed-loop , (1.5.1)

in Chapter 4, Section 5.1 and Section 5.2 of Chapter 5, respectively.

In the past few years, many theoretical frameworks have been proposed and developed to help understand deep networks. However, many were unable to provide scalable solutions that matched the performance of empirical methods on real-world data and tasks. Many theories do not provide useful guidance on how to further improve practice. Chapters 6 and 7 will show how the framework presented in this book may help bridge the gap between theory and practice. Chapter 6 will show how to use the learned distribution and its representation to conduct (Bayesian) inference for almost all practical tasks that depend on (conditional) generation, estimation, and prediction. Chapter 7 will provide convincing experimental evidence that networks and systems designed from the first principles can achieve competitive and potentially better performance on a variety of tasks such as visual representation learning, image classification, image completion, segmentation, and text generation.

Back to Intelligence.

As we have mentioned in the beginning, a common and fundamental task of any intelligent being is to learn predictable information from its sensed data. Now we have understood a little about the computational nature of this task, and one should realize that this is a never-ending process, for the following reasons:

  • The knowledge learned so far from the data, say by the encoding and decoding schemes, is unlikely to be correct or optimal. Intelligence should have the ability to improve if there are still errors in predicting new observations.

  • The data observed so far do not yet cover all the predictable information. Intelligence should be able to recognize that current knowledge is inadequate and have the capability to learn and acquire new information whenever available.

Hence, intelligence is not about simply collecting all data in advance and training a model to memorize all the predictable information in the data. In contrast, it is about being equipped with computational mechanisms that can constantly improve current knowledge and acquire new information when available and needed. That is, a fundamental characteristic of any intelligent being or system585858An animal, a human being, an intelligent robot, the scientific community, and even the entire civilization. is being able to continuously improve or gain information (or knowledge) on its own. Conceptually, we may illustrate symbolically the relationship between intelligence and information (or knowledge) as follows:

Intelligence(t)=ddtInformation(t),Information(t)=0tIntelligence(s)ds.\operatorname{Intelligence}(t)=\frac{\mathrm{d}}{\mathrm{d}t}\operatorname{Information}(t),\qquad\operatorname{Information}(t)=\int_{0}^{t}\operatorname{Intelligence}(s)\mathrm{d}s.roman_Intelligence ( italic_t ) = divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG roman_Information ( italic_t ) , roman_Information ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Intelligence ( italic_s ) roman_d italic_s . (1.5.2)

We believe that the closed-loop framework is a universal mechanism that enables self-improving and self-learning, via feedback595959Reinforcement can be viewed as a primitive form of feedback, say the natural selection by the nature. or gaming606060Scientific inquiries can be viewed as a most advanced form of gaming, through hypothesis formulation and verification.. All intelligent beings or systems in nature utilize closed-loop mechanisms for learning at all levels and on all scales. Its ubiquity had inspired early studies that tried to model and emulate intelligence by machines and computers, particularly the Cybernetics movement initiated by Norbert Wiener in the 1940s.

We hope that this book will help people better understand the objectives, principles, and computational mechanisms behind intelligence. It serves as a foundation for further study of higher-level human intelligence, the true “artificial” intelligence, in the future, which we will layout several significant open problems in these new directions at the end of the book in Chapter 8.