Chapter 1 An Informal Introduction

Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy.

  – Václav Havel

1.1 Intelligence, Cybernetics, and Artificial Intelligence

The world we inhabit is neither fully random nor completely unpredictable.111If the world were fully random, an intelligent being would have no need to learn or memorize anything. Instead, it follows certain orders, patterns, and laws that render it largely predictable.222Some deterministic, some probabilistic. The very emergence and persistence of life depend on this predictability. Only by learning and memorizing what is predictable in the environment can life survive and thrive, since sound decisions and actions hinge on reliable predictions. Because the world offers seemingly unlimited predictable phenomena, intelligent beings—animals and humans—have evolved ever more acute senses: vision, hearing, touch, taste, and smell. These senses harvest high-throughput sensory data to perceive environmental regularities. Hence, a fundamental task for all intelligent beings is to

learn and memorize predictable information from massive amounts of sensed data.

Before we can understand how this is accomplished, we must address three questions:

  • How can predictable information be modeled and represented mathematically?

  • How can such information be computationally learned effectively and efficiently from data?

  • How should this information be best organized to support future prediction and inference?

This book aims to provide some answers to these questions. These answers will help us better understand intelligence, especially the computational principles and mechanisms that enable it. Evidence suggests that all forms of intelligence—from low-level intelligence seen in early primitive life to the highest form of intelligence, the practice of modern science—share a common set of principles and mechanisms. We elaborate below.

Emergence and evolution of intelligence.

A necessary condition for the emergence of life on Earth about four billion years ago is that the environment is largely predictable. Life has developed mechanisms that allow it to learn what is predictable about the environment, encode this information, and use it for survival. Generally speaking, we call this ability to learn knowledge of the world intelligence. To a large extent, the evolution of life is the mechanism of intelligence at work [Ben23]. In early stages of life, intelligence is mainly developed through two types of learning mechanisms: phylogenetic and ontogenetic [Wie61].

Phylogenetic intelligence refers to learning through the evolution of species. Species inherit and survive mainly based on knowledge encoded in the DNA or genes of their parents. To a large extent, we may call DNA nature’s pre-trained large models because they play a very similar role. The main characteristic of phylogenetic intelligence is that individuals have limited learning capacity. Learning is carried out through a “trial-and-error” mechanism based on random mutation of genes, and species evolve based on natural selection—survival of the fittest—as shown in Figure 1.1. This can be viewed as nature’s implementation of what is now known as “reinforcement learning,” [SB18] or potentially “neural architecture search” [ZL17]. However, such a “trial-and-error” process can be extremely slow, costly, and unpredictable. From the emergence of the first life forms, about 4.4–3.8 billion years ago [BBH+15], life has relied on this form of evolution.333Astute readers may notice an uncanny similarity between how early life evolves and how large language models evolve today.

Figure 1.1 : Evolution of phylogenetic intelligence: Knowledge of the external world is encoded and passed on via DNA (left), and decoded from DNA to RNA and to proteins. In the early stage of life evolution (right), intelligence develops knowledge at the species level via (random) gene mutation and natural selection—“may the fittest survive”—which can be viewed as a primitive form of reinforcement learning.
Figure 1.1 : Evolution of phylogenetic intelligence: Knowledge of the external world is encoded and passed on via DNA (left), and decoded from DNA to RNA and to proteins. In the early stage of life evolution (right), intelligence develops knowledge at the species level via (random) gene mutation and natural selection—“may the fittest survive”—which can be viewed as a primitive form of reinforcement learning.
Figure 1.1: Evolution of phylogenetic intelligence: Knowledge of the external world is encoded and passed on via DNA (left), and decoded from DNA to RNA and to proteins. In the early stage of life evolution (right), intelligence develops knowledge at the species level via (random) gene mutation and natural selection—“may the fittest survive”—which can be viewed as a primitive form of reinforcement learning.
Figure 1.2 : Evolution of life, from the ancestor of all life today (LUCA—last universal common ancestor), a single-cell-like organism that lived 3.5–4.3 billion years ago [ MÁM+24 ] , to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago [ WVM+25 ] , to the explosion of life forms in the Cambrian period (right), about 530 million years ago.
Figure 1.2 : Evolution of life, from the ancestor of all life today (LUCA—last universal common ancestor), a single-cell-like organism that lived 3.5–4.3 billion years ago [ MÁM+24 ] , to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago [ WVM+25 ] , to the explosion of life forms in the Cambrian period (right), about 530 million years ago.
Figure 1.2 : Evolution of life, from the ancestor of all life today (LUCA—last universal common ancestor), a single-cell-like organism that lived 3.5–4.3 billion years ago [ MÁM+24 ] , to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago [ WVM+25 ] , to the explosion of life forms in the Cambrian period (right), about 530 million years ago.
Figure 1.2: Evolution of life, from the ancestor of all life today (LUCA—last universal common ancestor), a single-cell-like organism that lived 3.5–4.3 billion years ago [MÁM+24], to the emergence of the first nervous system in worm-like species (middle), about 550 million years ago [WVM+25], to the explosion of life forms in the Cambrian period (right), about 530 million years ago.

Ontogenetic intelligence refers to the learning mechanisms that allow an individual to learn through its own senses, memories, and predictions within its specific environment, and to improve and adapt its behaviors. Ontogenetic learning became possible after the emergence of the nervous system about 550–600 million years ago (in worm-like organisms) [WVM+25], shown in Figure 1.2 middle. With a sensory and nervous system, an individual can continuously form and improve its own knowledge about the world (memory), in addition to what is inherited from DNA or genes. This capability significantly enhanced individual survival and contributed to the Cambrian explosion of life forms about 530 million years ago [Par04]. Compared to phylogenetic learning, ontogenetic learning is more efficient and predictable, and can be realized within an individual’s resource limits.

Both types of learning rely on feedback from the external environment—penalties (death) or rewards (food)—applied to a species’ or an individual’s actions.444Gene mutation of the species or actions made by the individual. This insight inspired Norbert Wiener to conclude in his Cybernetics program [Wie48] that all intelligent beings, whether species or individuals, rely on closed-loop feedback mechanisms to learn and improve their knowledge about the world. Furthermore, from plants to fish, birds, and mammals, more advanced species increasingly rely on ontogenetic learning: they remain with and learn from their parents for longer periods after birth, because individuals of the same species must survive in very diverse environments.

Evolution of human intelligence.

Since the emergence of Homo sapiens about 2.5 million years ago [Har15], a new, higher form of intelligence has emerged that evolves more efficiently and economically. Human societies developed languages—first spoken, later written—as shown in Figure 1.3. Language enables individuals to communicate and share useful information, allowing a human community to behave as a single intelligent organism that learns faster and retains more knowledge than any individual. Written texts thus play a role analogous to DNA and genes, enabling societies to accumulate and transmit knowledge across generations. We may refer to this type of intelligence as societal intelligence, distinguishing it from the phylogenetic intelligence of species and the ontogenetic intelligence of individuals. This knowledge accumulation underpins (ancient) civilizations.

Figure 1.3 : The development of verbal communication and spoken languages (70,000–30,000 years ago), written languages (about 3000 BC) [ Sch14 ] , and abstract mathematics (around 500–300 BC) [ Hea+56 ] mark three key milestones in the evolution of human intelligence.
Figure 1.3 : The development of verbal communication and spoken languages (70,000–30,000 years ago), written languages (about 3000 BC) [ Sch14 ] , and abstract mathematics (around 500–300 BC) [ Hea+56 ] mark three key milestones in the evolution of human intelligence.
Figure 1.3 : The development of verbal communication and spoken languages (70,000–30,000 years ago), written languages (about 3000 BC) [ Sch14 ] , and abstract mathematics (around 500–300 BC) [ Hea+56 ] mark three key milestones in the evolution of human intelligence.
Figure 1.3: The development of verbal communication and spoken languages (70,000–30,000 years ago), written languages (about 3000 BC) [Sch14], and abstract mathematics (around 500–300 BC) [Hea+56] mark three key milestones in the evolution of human intelligence.

About two to three thousand years ago, human intelligence took another major leap, enabling philosophers and mathematicians to develop knowledge that goes far beyond organizing empirical observations. The development of abstract concepts and symbols, such as numbers, time, space, logic, and geometry, gave rise to an entirely new and rigorous language of mathematics. In addition, the development of the ability to generate hypotheses and verify their correctness through logical deduction or experimentation laid the foundation for modern science. For the first time, humans could proactively and systematically discover and develop new knowledge. We will call this advanced form of intelligence “scientific intelligence” due to its necessity for deductive and scientific discovery.

Hence, from what we can learn from nature, whenever we use the word “intelligence,” we must be specific about which level or form we mean:

phylogeneticontogeneticsocietalscientific intelligence.\text{{phylogenetic}}\;\Longrightarrow\;\text{{ontogenetic}}\;\Longrightarrow\;\text{{societal}}\;\Longrightarrow\;\text{{scientific intelligence}}.phylogenetic ⟹ ontogenetic ⟹ societal ⟹ scientific intelligence . (1.1.1)

Clear characterization and distinction are necessary because we want to study intelligence as a scientific and mathematical subject. Although all forms may share the common objective of learning useful knowledge about the world, the specific computational mechanisms and physical implementations behind each level could differ. We believe the reader will better understand and appreciate these differences after studying this book. Therefore, we leave further discussion of general intelligence to the last Chapter 8.

Origin of machine intelligence—cybernetics.

In the 1940s, spurred by the war effort, scientists inspired by natural intelligence sought to emulate animal intelligence with machines, giving rise to the “Cybernetics” movement championed by Norbert Wiener [Kli11]. Wiener studied zoology at Harvard as an undergraduate before becoming a mathematician and control theorist. He devoted his life to understanding and building autonomous systems that could reproduce animal-like intelligence. Today, the Cybernetics program is often narrowly interpreted as being mainly about feedback control systems, the area in which Wiener made his most significant technical contributions. Yet the program was far broader and deeper: it aimed to understand intelligence as a whole—at least at the animal level—and influenced the work of an entire generation of renowned scientists, including Warren McCulloch, Walter Pitts, Claude Shannon, John von Neumann, and Alan Turing.

Figure 1.4 : Norbert Wiener’s book “Cybernetics” (1948) [ Wie48 ] (left) and its second edition (1961) [ Wie61 ] (right).
Figure 1.4 : Norbert Wiener’s book “Cybernetics” (1948) [ Wie48 ] (left) and its second edition (1961) [ Wie61 ] (right).
Figure 1.4: Norbert Wiener’s book “Cybernetics” (1948) [Wie48] (left) and its second edition (1961) [Wie61] (right).

Wiener was arguably the first to study intelligence as a system, rather than focusing on isolated components or aspects of intelligence. His comprehensive views appeared in the celebrated 1948 book Cybernetics: or Control and Communication in the Animal and the Machine [Wie48]. In that book and its second edition published in 1961 [Wie61] (see Figure 1.4), he attempted to identify several necessary characteristics and mechanisms of intelligent systems, including (but not limited to):

  • How to measure and store information (in the brain) and how to communicate with others.555Wiener was the first to point out that “information” is neither matter nor energy, but an independent quantity worthy of study. This insight led Claude Shannon to formulate information theory in 1948 [Sha48].

  • How to correct errors in prediction and estimation based on existing information. Wiener himself helped formalize the theory of (closed-loop) feedback control in the 1940s.

  • How to learn to make better decisions when interacting with a non-cooperative or even adversarial environment. John von Neumann formalized this as game theory in 1944 [NMR44].

In 1943, inspired by Wiener’s Cybernetics program, cognitive scientist Warren McCulloch and logician Walter Pitts jointly formalized the first computational model of a neuron [MP43], called an artificial neuron and illustrated later in Figure 1.13. Building on this model, Frank Rosenblatt constructed the Mark I Perceptron in the 1950s—a physical machine containing hundreds of such artificial neurons [Ros57]. The Perceptron was the first physically realized artificial neural network; see Figure 1.15. Notably, John von Neumann’s universal computer architecture, proposed in 1945, was also designed to facilitate the goal of building computing machines that could physically realize the mechanisms suggested by the Cybernetics program [Neu58].

Astute readers will notice that the 1940s were truly magical: many fundamental ideas were invented and influential theories formalized, including the mathematical model of neurons, artificial neural networks, information theory, control theory, game theory, and computing machines. Figure 1.5 portrays some of the pioneers. Each of these contributions has grown to become the foundation of a scientific or engineering field and continues to have tremendous impact on our lives. All were inspired by the goal of developing machines that emulate intelligence in nature. Historical records show that Wiener’s Cybernetics movement influenced nearly all of these pioneers and works. To a large extent, Wiener’s program can be viewed as the true predecessor of today’s “embodied intelligence” program; in fact, Wiener himself articulated a vision for such a program with remarkable clarity and concreteness [Wie61].

Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5 : Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).
Figure 1.5: Pioneers of theoretical and computational foundations for intelligence: Norbert Wiener (cybernetics and control theory), Claude Shannon (information theory), John von Neumann (game theory), and Alan Turing (computing theory).

Although Wiener identified many key characteristics and mechanisms of (embodied) intelligence, he offered no clear recipe for integrating them into a complete autonomous intelligent system. From today’s perspective, some of his views were incomplete or inaccurate. In particular, in the last chapter of the second edition of Cybernetics [Wie61], he stressed the need to deal with nonlinearity if machines are to emulate typical learning mechanisms in nature, yet he provided no concrete, effective solution for this difficult issue. In fairness, even the theory of linear systems was in its infancy at the time, and nonlinear systems seemed far less approachable.

Nevertheless, we cannot help but marvel at Wiener’s prescience. about the importance of nonlinearity. As this book will show, the answer came only recently: nonlinearity can be handled effectively through progressive linearization and transformation realized by deep networks and representations (see Chapter 4). Moreover, we will demonstrate in this book how all the mechanisms listed above can be naturally integrated into a complete system which exhibits certain characteristics of an autonomous intelligent system (see Chapter 5).

Origin of artificial intelligence.

The subtitle of Wiener’s CyberneticsControl and Communication in the Animal and the Machine—reveals that 1940s research aimed primarily at emulating animal-level intelligence. As noted earlier, the agendas of that era were dominated by Wiener’s Cybernetics movement.

Alan Turing was among the first to recognize this limitation. In his celebrated 1950 paper “Computing Machinery and Intelligence[Tur50], Turing formally posed the question of whether machines could imitate human-level intelligence, to the point of machine intelligence being indistinguishable from human intelligence—now known as the Turing test.

Around 1955, a group of ambitious young scientists sought to break away from the then-dominant Cybernetics program and establish their own legacy. They accepted Turing’s challenge of imitating human intelligence and proposed a workshop at Dartmouth College to be held in the summer of 1956. Their proposal stated [MMR+06]:

The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.”

They aimed to formalize and study the higher-level intelligence that distinguishes humans from animals. Their agenda included abstraction, symbolic methods, natural language, and deductive reasoning (causal inference, logical deduction, etc.), i.e., scientific intelligence. The organizer of the workshop, John McCarthy—then a young assistant professor of mathematics at Dartmouth College—coined the now-famous term “Artificial Intelligence” (AI) to describe machines that exhibit scientific intelligence in order to formally describe the problems and goals of the workshop.

Renaissance of “artificial intelligence” or “cybernetics”?

Over the past decade, machine intelligence has undergone explosive development, driven largely by deep artificial neural networks, sparked by the 2012 work of Geoffrey Hinton and his students [KSH12]. This period is hailed as the “Renaissance of AI.” Yet, in terms of the tasks actually tackled (recognition, generation, prediction) and the techniques developed (reinforcement learning, imitation learning, encoding, decoding, denoising, and compression), we are largely emulating mechanisms common to the intelligence of early life and animals. Even in this regard, as we will clarify in this book, current “AI” models and systems have not fully or correctly implemented all necessary mechanisms for intelligence at the animal level which were known to the 1940s Cybernetics movement.

Strictly speaking, the recent advances of machine intelligence in the past decade do not align well with the 1956 Dartmouth AI program. Instead, what has been achieved is closer to the objectives of Wiener’s classic 1940s Cybernetics program. It is probably more appropriate to call the current era the “Renaissance of Cybernetics.”666The recent rise of so-called “Embodied AI” for autonomous robots aligns even more closely with the goals of the Cybernetics program. Only after we fully understand, from scientific and mathematical perspectives, what we have truly accomplished can we determine what remains and which directions lead to the true nature of intelligence. That is one of the main purposes of this book.

1.2 What to Learn?

1.2.1 Predictability

Data that carry useful information manifest in many forms. In their most natural form, they can be modeled as sequences that are predictable and computable. The notion of a predictable and computable sequence was central to the theory of computing and largely led to the invention of computers [Tur36]. The role of predictable sequences in (inductive) inference was studied by Ray Solomonoff, Andrey Kolmogorov, and others in the 1960s [Kol98] as a generalization of Claude Shannon’s classic information theory [Sha48]. To understand the concept of predictable sequences, we begin with some concrete examples.

Scalar case.

The simplest predictable discrete sequence is arguably the sequence of natural numbers:

S=1,2,3,4,5,6,,n,n+1,S=1,2,3,4,5,6,\ldots,n,n+1,\ldotsitalic_S = 1 , 2 , 3 , 4 , 5 , 6 , … , italic_n , italic_n + 1 , … (1.2.1)

in which the next number xn+1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is defined as the previous number xnx_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT plus 1:

xn+1=xn+1.x_{n+1}=x_{n}+1.italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 1 . (1.2.2)

One may generalize the notion of predictability to any sequence {xn}n=1\{x_{n}\}_{n=1}^{\infty}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT with xnx_{n}\in\mathbb{R}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R if the next number xn+1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT can always be computed from its predecessor xnx_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

xn+1=f(xn),xn,n=1,2,3,x_{n+1}=f(x_{n}),\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.3)

where f()f(\cdot)italic_f ( ⋅ ) is a computable (scalar) function.777Here we emphasize that the function f()f(\cdot)italic_f ( ⋅ ) itself is computable, meaning it can be implemented as a program on a computer. Alan Turing’s seminal work in 1936 [Tur36] gives a rigorous definition of computability. In practice, we often further assume that ffitalic_f is efficiently computable and has nice properties such as continuity and differentiability. The necessity of these properties will become clear later once we understand more refined notions of computability and their roles in machine learning and intelligence.

Multivariable case.

The next value may also depend on two predecessors. For example, the famous Fibonacci sequence

S=1,1,2,3,5,8,13,21,34,55,S=1,1,2,3,5,8,13,21,34,55,\ldotsitalic_S = 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34 , 55 , … (1.2.4)

satisfies

xn+2=xn+1+xn,xn,n=1,2,3,x_{n+2}=x_{n+1}+x_{n},\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.5)

More generally, we may write

xn+2=f(xn+1,xn),xn,n=1,2,3,x_{n+2}=f(x_{n+1},x_{n}),\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.6)

for any computable function ffitalic_f that takes two inputs. Extending further, the next value may depend on the preceding dditalic_d values:

xn+d=f(xn+d1,,xn),xn,n=1,2,3,x_{n+d}=f(x_{n+d-1},\ldots,x_{n}),\quad x_{n}\in\mathbb{R},\;n=1,2,3,\ldotsitalic_x start_POSTSUBSCRIPT italic_n + italic_d end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n + italic_d - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R , italic_n = 1 , 2 , 3 , … (1.2.7)

The integer dditalic_d is called the degree of the recursion. The above expression (1.2.7) is called an autoregression, and the resulting sequence is autoregressive. When ffitalic_f is linear, we say it is a linear autoregression.

Vector case.

To simplify notation, we define a vector 𝒙d\bm{x}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that collects dditalic_d consecutive values in the sequence:

𝒙n[xn+d1,,xn],𝒙nd,n=1,2,3,\bm{x}_{n}\doteq[x_{n+d-1},\ldots,x_{n}]^{\top},\quad\bm{x}_{n}\in\mathbb{R}^{d},\;n=1,2,3,\ldotsbold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≐ [ italic_x start_POSTSUBSCRIPT italic_n + italic_d - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_n = 1 , 2 , 3 , … (1.2.8)

With this notation, the recursive relation (1.2.7) becomes

𝒙n+1=g(𝒙n)d,n=1,2,3,\bm{x}_{n+1}=g(\bm{x}_{n})\in\mathbb{R}^{d},\quad n=1,2,3,\ldotsbold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_n = 1 , 2 , 3 , … (1.2.9)

where g()g(\cdot)italic_g ( ⋅ ) is uniquely determined by the function ffitalic_f in (1.2.7) and maps a dditalic_d-dimensional vector to a dditalic_d-dimensional vector. In different contexts, such a vector is sometimes called a “state” or a “token.” Note that (1.2.7) defines a mapping d\mathbb{R}^{d}\to\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, whereas here we have g:ddg\colon\mathbb{R}^{d}\to\mathbb{R}^{d}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Controlled prediction.

We may also define a predictable sequence that depends on another predictable sequence as input:

𝒙n+1=f(𝒙n,𝒖n)d,n=1,2,3,,\bm{x}_{n+1}=f(\bm{x}_{n},\bm{u}_{n})\in\mathbb{R}^{d},\quad n=1,2,3,\ldots,bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_n = 1 , 2 , 3 , … , (1.2.10)

where {𝒖n}\{\bm{u}_{n}\}{ bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with 𝒖nk\bm{u}_{n}\in\mathbb{R}^{k}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a (computable) predictable sequence. In other words, the next vector 𝒙n+1d\bm{x}_{n+1}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT depends on both 𝒙nd\bm{x}_{n}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒖nk\bm{u}_{n}\in\mathbb{R}^{k}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. In control theory, the sequence {𝒖n}\{\bm{u}_{n}\}{ bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is often referred to as the “control input” and 𝒙n\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the “state” or “output” of the system (1.2.10). A classic example is a linear dynamical system:

𝒙n+1=𝑨𝒙n+𝑩𝒖n,𝑨d×d,𝑩d×k,\bm{x}_{n+1}=\bm{A}\bm{x}_{n}+\bm{B}\bm{u}_{n},\quad\bm{A}\in\mathbb{R}^{d\times d},\bm{B}\in\mathbb{R}^{d\times k},bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = bold_italic_A bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_B bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT , bold_italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT , (1.2.11)

which is widely studied in control theory [CD91].

Often the control input is given by a computable function of the state 𝒙n\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT itself:

𝒖n=h(𝒙n),n=1,2,3,\bm{u}_{n}=h(\bm{x}_{n}),\quad n=1,2,3,\ldotsbold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_n = 1 , 2 , 3 , … (1.2.12)

As a result, the sequence {𝒙n}\{\bm{x}_{n}\}{ bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is given by composing the two computable functions ffitalic_f and hhitalic_h:

𝒙n+1=f(𝒙n,h(𝒙n)),n=1,2,3,\bm{x}_{n+1}=f(\bm{x}_{n},h(\bm{x}_{n})),\quad n=1,2,3,\ldotsbold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) , italic_n = 1 , 2 , 3 , … (1.2.13)

In this way, the sequence {𝒙n}\{\bm{x}_{n}\}{ bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } again becomes an autoregressive predictable sequence. When the input 𝒖n\bm{u}_{n}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT depends on the output 𝒙n\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we say the resulting sequence is produced by a “closed-loop” system (1.2.13). As the closed-loop system no longer depends on any external input, we say such a system has become autonomous. It can be viewed as a special case of autoregression. For instance, if we choose 𝒖n=𝑭𝒙n\bm{u}_{n}=\bm{F}\bm{x}_{n}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_F bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in the above linear system (1.2.11), the closed-loop system becomes

𝒙n+1=𝑨𝒙n+𝑩𝒖n=𝑨𝒙n+𝑩𝑭𝒙n=(𝑨+𝑩𝑭)𝒙n,\bm{x}_{n+1}=\bm{A}\bm{x}_{n}+\bm{B}\bm{u}_{n}=\bm{A}\bm{x}_{n}+\bm{B}\bm{F}\bm{x}_{n}=(\bm{A}+\bm{B}\bm{F})\bm{x}_{n},bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = bold_italic_A bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_B bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_A bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_italic_B bold_italic_F bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_italic_A + bold_italic_B bold_italic_F ) bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (1.2.14)

which is a linear autoregression.

Continuous processes.

Predictable sequences have natural continuous counterparts, which we call predictable processes. The simplest such process is time itself, x(t)=tx(t)=titalic_x ( italic_t ) = italic_t.

More generally, a process 𝒙(t)\bm{x}(t)bold_italic_x ( italic_t ) is predictable if, at every time ttitalic_t, its value at t+δtt+\delta titalic_t + italic_δ italic_t is determined by its value at ttitalic_t, where δt\delta titalic_δ italic_t is an infinitesimal increment. Typically, 𝒙(t)\bm{x}(t)bold_italic_x ( italic_t ) is continuous and smooth, so the change δ𝒙(t)=𝒙(t+δt)𝒙(t)\delta\bm{x}(t)=\bm{x}(t+\delta t)-\bm{x}(t)italic_δ bold_italic_x ( italic_t ) = bold_italic_x ( italic_t + italic_δ italic_t ) - bold_italic_x ( italic_t ) is infinitesimally small. Such processes are typically described by (multivariate) differential equations

𝒙˙(t)=f(𝒙(t)),𝒙d.\dot{\bm{x}}(t)=f(\bm{x}(t)),\quad\bm{x}\in\mathbb{R}^{d}.over˙ start_ARG bold_italic_x end_ARG ( italic_t ) = italic_f ( bold_italic_x ( italic_t ) ) , bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (1.2.15)

In systems theory [CD91, Sas99], the equation (1.2.15) is known as a state-space model. A controlled process is given by

𝒙˙(t)=f(𝒙(t),𝒖(t)),𝒙d,𝒖k,\dot{\bm{x}}(t)=f(\bm{x}(t),\bm{u}(t)),\quad\bm{x}\in\mathbb{R}^{d},\bm{u}\in\mathbb{R}^{k},over˙ start_ARG bold_italic_x end_ARG ( italic_t ) = italic_f ( bold_italic_x ( italic_t ) , bold_italic_u ( italic_t ) ) , bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (1.2.16)

where 𝒖(t)\bm{u}(t)bold_italic_u ( italic_t ) is a computable input process.

Example 1.1.

Newton’s second law predicts the trajectory 𝒙(t)3\bm{x}(t)\in\mathbb{R}^{3}bold_italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT of a moving object under a force 𝑭(t)3\bm{F}(t)\in\mathbb{R}^{3}bold_italic_F ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:

m𝒙¨(t)=𝑭(t).m\ddot{\bm{x}}(t)=\bm{F}(t).italic_m over¨ start_ARG bold_italic_x end_ARG ( italic_t ) = bold_italic_F ( italic_t ) . (1.2.17)

When there is no force, i.e., 𝑭(t)𝟎\bm{F}(t)\equiv\bm{0}bold_italic_F ( italic_t ) ≡ bold_0, this reduces to Newton’s first law: the object moves at constant velocity 𝒗3\bm{v}\in\mathbb{R}^{3}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:

𝒙¨(t)=𝟎there exists 𝒗3 such that 𝒙˙(t)=𝒗.\ddot{\bm{x}}(t)=\bm{0}\iff\ \text{there exists }\bm{v}\in\mathbb{R}^{3}\text{ such that }\dot{\bm{x}}(t)=\bm{v}.over¨ start_ARG bold_italic_x end_ARG ( italic_t ) = bold_0 ⇔ there exists bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT such that over˙ start_ARG bold_italic_x end_ARG ( italic_t ) = bold_italic_v . (1.2.18)

\blacksquare

1.2.2 Low Dimensionality

Learning to predict.

Now suppose you have observed or have been given many sequence segments:

{S1,S2,,Si,,SN}\{S_{1},S_{2},\ldots,S_{i},\ldots,S_{N}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } (1.2.19)

drawn from a predictable sequence {xn}n=1\{x_{n}\}_{n=1}^{\infty}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT. Without loss of generality, assume each segment has length DdD\gg ditalic_D ≫ italic_d, so

Si=[xj(i),xj(i)+1,,xj(i)+D1]DS_{i}=[x_{j(i)},x_{j(i)+1},\ldots,x_{j(i)+D-1}]^{\top}\in\mathbb{R}^{D}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_j ( italic_i ) end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j ( italic_i ) + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_j ( italic_i ) + italic_D - 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT (1.2.20)

for some jj\in\mathbb{N}italic_j ∈ blackboard_N. You are then given a new segment StS_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and asked to predict its future values.

The difficulty is that the generating function ffitalic_f and its order dditalic_d are unknown:

xn+d=f(xn+d1,,xn).x_{n+d}=f(x_{n+d-1},\ldots,x_{n}).italic_x start_POSTSUBSCRIPT italic_n + italic_d end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n + italic_d - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (1.2.21)

The goal is therefore to learn ffitalic_f and dditalic_d from the sample segments S1,S2,,SNS_{1},S_{2},\ldots,S_{N}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. The central task of learning to predict is:

Given many sampled segments of a predictable sequence, how can we effectively and efficiently identify the function ffitalic_f?

Predictability and low-dimensionality.

To identify the predictive function ffitalic_f, we may notice a common characteristic of segments of any predictable sequence given by (1.2.21). If we take a long segment, say of length DdD\gg ditalic_D ≫ italic_d, and view it as a vector

𝒙i=[xi,xi+1,,xi+D1]D,\bm{x}_{i}=[x_{i},x_{i+1},\ldots,x_{i+D-1}]^{\top}\in\mathbb{R}^{D},bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i + italic_D - 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , (1.2.22)

then the set of all such vectors {𝒙i}\{\bm{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is far from random and cannot occupy the entire space D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Instead, it has at most dditalic_d degrees of freedom—given the first dditalic_d entries of any 𝒙i\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the remaining entries are uniquely determined. In other words, all {𝒙i}\{\bm{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } lie on a dditalic_d-dimensional surface. In mathematics, such a surface is called a submanifold, denoted 𝒮D\mathcal{S}\subset\mathbb{R}^{D}caligraphic_S ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Figure 1.6 : A two-dimensional subspace in a ten-dimensional ambient space.
Figure 1.6: A two-dimensional subspace in a ten-dimensional ambient space.

In practice, if we choose the segment length DDitalic_D large enough, all segments sampled from the same predicting function lie on a surface with intrinsic dimension dditalic_d, significantly lower than that of the ambient space DDitalic_D. For example, if the sequence is given by the linear autoregression

xn+2=axn+1+bxn,x_{n+2}=ax_{n+1}+bx_{n},italic_x start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT = italic_a italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + italic_b italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (1.2.23)

for some constants a,ba,b\in\mathbb{R}italic_a , italic_b ∈ blackboard_R, and we sample segments of length D=10D=10italic_D = 10, then all samples lie on a two-dimensional plane in 10\mathbb{R}^{10}blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, as illustrated in Figure 1.6. Identifying this two-dimensional subspace fully determines the constants aaitalic_a and bbitalic_b in (1.2.23).

More generally, when the predicting function ffitalic_f is linear, as in the systems given in (1.2.11) and (1.2.14), the long segments always lie on a low-dimensional linear subspace. Identifying the predicting function is then largely equivalent to identifying this subspace, a problem known as principal component analysis. We will discuss such classic models and methods in Chapter 2.

This observation extends to general predictable sequences: if we can identify the low-dimensional surface on which the segment samples lie, we can identify the predictive function ffitalic_f.888Under mild conditions, there is a one-to-one mapping between the low-dimensional surface and the function ffitalic_f. This fact has been exploited in problems such as system identification, which we will discuss later. We cannot overemphasize the importance of this property: All samples of long segments of a predictable sequence lie on a low-dimensional submanifold. As we will see in this book, all modern learning methods exploit this property, implicitly or explicitly.

In real-world scenarios, observed data often come from multiple predictable sequences. For example, a video sequence may contain several moving objects. In such cases, the data lie on a mixture of low-dimensional linear subspaces or nonlinear submanifolds, as illustrated in Figure 1.7.

Figure 1.7 : Data distributed on a mixture of (orthogonal) subspaces (left) or submanifolds (right).
Figure 1.7: Data distributed on a mixture of (orthogonal) subspaces (left) or submanifolds (right).
Properties of low-dimensionality.

Temporal correlation in predictable sequences is not the only reason data are low-dimensional. For example, the space of all images is vast, yet most of it consists of structureless random images, as shown in Figure 1.8 (left). Natural images and videos, however, are highly redundant because of strong spatial and temporal correlations among pixel values. This redundancy allows us to recognize easily whether an image is noisy or clean, as shown in Figure 1.8 (middle and right). Consequently, the distribution of natural images has a very low intrinsic dimension relative to the total number of pixels.

Figure 1.8 : An image of random noise (left) versus a noisy image (middle) and the original clean image (right).
Figure 1.8 : An image of random noise (left) versus a noisy image (middle) and the original clean image (right).
Figure 1.8 : An image of random noise (left) versus a noisy image (middle) and the original clean image (right).
Figure 1.8: An image of random noise (left) versus a noisy image (middle) and the original clean image (right).

Because learning low-dimensional structures is both important and ubiquitous, the book High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications [WM22] begins with the statement: “The problem of identifying the low-dimensional structure of signals or data in high-dimensional spaces is one of the most fundamental problems that, through a long history, interweaves many engineering and mathematical fields such as system theory, signal processing, pattern recognition, machine learning, and statistics.

By constraining the observed data point 𝒙\bm{x}bold_italic_x to lie on a low-dimensional surface, we make its entries highly dependent on one another and, in a sense, “predictable” from the values of other entries. For example, if we know the data are constrained to a dditalic_d-dimensional surface in D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, we can perform several useful tasks beyond prediction:

  • completion: given more than dditalic_d entries of a typical sample 𝒙\bm{x}bold_italic_x, the remaining entries can usually be uniquely determined;999Prediction becomes a special case of this property.

  • denoising: if the entries of a sample 𝒙\bm{x}bold_italic_x are perturbed by small noise, the noise can be effectively removed by projecting 𝒙\bm{x}bold_italic_x back onto the surface;

  • error correction: if a small number of unknown entries of 𝒙\bm{x}bold_italic_x are arbitrarily corrupted, they can be efficiently corrected.

Figure 1.9 illustrates these properties using a low-dimensional linear structure—a one-dimensional line in a two-dimensional plane.

Figure 1.9 : Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).
Figure 1.9 : Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).
Figure 1.9 : Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).
Figure 1.9: Illustration of properties of a low-dimensional (linear) structure: it enables completion (left), denoising (middle), and error correction (right).

Under mild conditions, these properties generalize to many other low-dimensional structures in high-dimensional spaces [WM22]. As we will see, these useful properties—completion and denoising, for example—inspire effective methods for learning such structures.

For simplicity, we have so far used the deterministic case to introduce the notions of predictability and low-dimensionality, where data lie precisely on geometric structures such as subspaces or surfaces. In practice, however, data always contain some uncertainty or randomness. In this case, we may assume the data follow a probability distribution with density p(𝒙)p(\bm{x})italic_p ( bold_italic_x ). A distribution is considered “low-dimensional” if its density concentrates around a low-dimensional geometric structure—a subspace, a surface, or a mixture thereof—as shown in Figure 1.7. Once learned, such a density p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) serves as a powerful prior for estimating 𝒙\bm{x}bold_italic_x from partial, noisy, or corrupted observations:

𝒚=f(𝒙)+𝒏,\bm{y}=f(\bm{x})+\bm{n},bold_italic_y = italic_f ( bold_italic_x ) + bold_italic_n , (1.2.24)

by computing the conditional estimate 𝒙^(𝒚)=𝔼(𝒙𝒚)\hat{\bm{x}}(\bm{y})=\operatorname{\mathbb{E}}(\bm{x}\mid\bm{y})over^ start_ARG bold_italic_x end_ARG ( bold_italic_y ) = blackboard_E ( bold_italic_x ∣ bold_italic_y ) or by sampling the conditional distribution 𝒙^(𝒚)p(𝒙𝒚)\hat{\bm{x}}(\bm{y})\sim p(\bm{x}\mid\bm{y})over^ start_ARG bold_italic_x end_ARG ( bold_italic_y ) ∼ italic_p ( bold_italic_x ∣ bold_italic_y ).101010Modern generative AI technologies, such as conditioned image generation, rely heavily on this principle, as we will discuss in Chapter 6.

These discussions lead to the single assumption on which this book bases its deductive approach to understanding intelligence and deep networks:

The main assumption: Any intelligent system or learning method should (and can) rely on the predictability of the world; hence, the distribution of observed high-dimensional data samples has low-dimensional support.

The remaining question is how to learn these low-dimensional structures correctly and computationally efficiently from high-dimensional data. As we will see, parametric models studied in classical analytical approaches and non-parametric models such as deep networks, popular in modern practice, are simply different means to the same end.

1.3 How to Learn?

1.3.1 Analytical Approaches

Note that even if a predictive function is tractable to compute, it does not imply that it is tractable or scalable to learn this function from a finite number of sampled segments. One classical approach to ensure tractability is to make explicit assumptions about the family of low-dimensional structures we are dealing with. Historically, due to limited computation and data, simple and idealistic analytical models were the first to be studied, as they often offer efficient closed-form or numerical solutions. In addition, they provide insights into more general problems and already yield useful solutions to important, though limited, cases. In the old days, when computational resources were scarce, analytical models that permitted efficient closed-form or numerical solutions were the only cases that could be implemented. Linear structures became the first class of models to be thoroughly studied.

For example, arguably the simplest case is to assume the data are distributed around a single low-dimensional subspace in a high-dimensional space. Somewhat equivalently, one may assume the data are distributed according to an almost degenerate low-dimensional Gaussian. Identifying such a subspace or Gaussian from a finite number of (noisy) samples is the classical problem of principal component analysis (PCA), and effective algorithms have been developed for this class of models [Jol02]. One can make the family of models increasingly more complex and expressive. For instance, one may assume the data are distributed around a mixture of low-dimensional components (subspaces or low-dimensional Gaussians), as in independent component analysis (ICA) [BJC85], dictionary learning (DL), generalized principal component analysis (GPCA) [VMS05], or the more general class of sparse low-dimensional models that have been studied extensively in recent years in fields such as compressive sensing [WM22].

Across all these analytical model families, the central problem is to identify the most compact model within each family that best fits the given data. Below, we give a brief account of these classical analytical models but leave a more systematic study to Chapter 2. In theory, these analytical models have provided tremendous insights into the geometric and statistical properties of low-dimensional structures. They often yield closed-form solutions or efficient and scalable algorithms, which are very useful for data whose distributions can be well approximated by such models. More importantly, for more general problems, they provide a sense of how easy or difficult the problem of identifying low-dimensional structures can be and what the basic ideas are to approach such a problem.

Linear Dynamical Systems

Wiener filter.

As discussed in Section 1.2.1, a central task of intelligence is to learn what is predictable in sequences of observations. The simplest class of predictable sequences—or signals—are those generated by a linear time-invariant (LTI) process:

x[n]=h[n]z[n]+ϵ[n],x[n]=h[n]*z[n]+\epsilon[n],italic_x [ italic_n ] = italic_h [ italic_n ] ∗ italic_z [ italic_n ] + italic_ϵ [ italic_n ] , (1.3.1)

where * is the convolution operation, zzitalic_z is the input and hhitalic_h is the impulse response function.111111Typically, hhitalic_h is assumed to have certain desirable properties, such as finite length or a band-limited spectrum. Here ϵ[n]\epsilon[n]italic_ϵ [ italic_n ] denotes additive observation noise. Given the input process {z[n]}\{z[n]\}{ italic_z [ italic_n ] } and observations of the output process {x[n]}\{x[n]\}{ italic_x [ italic_n ] }, the goal is to find the optimal h[n]h[n]italic_h [ italic_n ] such that x^[n]=h[n]z[n]\hat{x}[n]=h[n]*z[n]over^ start_ARG italic_x end_ARG [ italic_n ] = italic_h [ italic_n ] ∗ italic_z [ italic_n ] predicts x[n]x[n]italic_x [ italic_n ] optimally. Prediction quality (i.e., goodness) is measured by the mean squared error (MSE):

minh𝔼[ϵ[n]2]=𝔼[x[n]h[n]z[n]22].\min_{h}\operatorname{\mathbb{E}}[\epsilon[n]^{2}]=\operatorname{\mathbb{E}}[\|x[n]-h[n]*z[n]\|_{2}^{2}].roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT blackboard_E [ italic_ϵ [ italic_n ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ italic_x [ italic_n ] - italic_h [ italic_n ] ∗ italic_z [ italic_n ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1.3.2)

The optimal solution h[n]h[n]italic_h [ italic_n ] is called a (denoising) filter. Norbert Wiener—who also initiated the Cybernetics movement—studied this problem in the 1940s and derived an elegant closed-form solution known as the Wiener filter [Wie42, Wie49]. This result, also known as least-variance estimation and filtering, became one of the cornerstones of signal processing. Interested readers may refer to [MKS+04, Appendix B] for a detailed derivation of this type of estimator.

Kalman filter.

The idea of denoising or filtering a dynamical system was later extended by Rudolph Kalman in the 1960s to a linear time-invariant system described by a finite-dimensional state-space model:

𝒛[n]=𝑨𝒛[n1]+𝑩𝒖[n]+ϵ[n].\bm{z}[n]=\bm{A}\bm{z}[n-1]+\bm{B}\bm{u}[n]+\bm{\epsilon}[n].bold_italic_z [ italic_n ] = bold_italic_A bold_italic_z [ italic_n - 1 ] + bold_italic_B bold_italic_u [ italic_n ] + bold_italic_ϵ [ italic_n ] . (1.3.3)

The problem is to estimate the system state 𝒛[n]\bm{z}[n]bold_italic_z [ italic_n ] from noisy observations of the form

𝒙[n]=𝑪𝒛[n]+𝒘[n],\bm{x}[n]=\bm{C}\bm{z}[n]+\bm{w}[n],bold_italic_x [ italic_n ] = bold_italic_C bold_italic_z [ italic_n ] + bold_italic_w [ italic_n ] , (1.3.4)

where 𝒘\bm{w}bold_italic_w is (white) noise. The optimal causal121212This means the estimator can use only observations up to the current time step nnitalic_n. The Kalman filter is always causal, whereas the Wiener filter need not be. state estimator that minimizes the minimum-MSE (MMSE) prediction error

min𝔼[𝒙[n]𝑪𝒛[n]22]\min\operatorname{\mathbb{E}}[\|\bm{x}[n]-\bm{C}\bm{z}[n]\|_{2}^{2}]roman_min blackboard_E [ ∥ bold_italic_x [ italic_n ] - bold_italic_C bold_italic_z [ italic_n ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1.3.5)

is given in closed form by the so-called Kalman filter [Kal60]. This result is a cornerstone of modern control theory because it enables estimation of a dynamical system’s state from noisy observations. One can then introduce linear state feedback, for example 𝒖[n]=𝑭𝒛^[n]\bm{u}[n]=\bm{F}\hat{\bm{z}}[n]bold_italic_u [ italic_n ] = bold_italic_F over^ start_ARG bold_italic_z end_ARG [ italic_n ], and render the closed-loop system fully autonomous, as shown in Equation 1.2.13. Interested readers may refer to [MKS+04, Appendix B] for a detailed derivation of the Kalman filter.

Identification of linear dynamical systems.

To derive the Kalman filter, the system parameters 𝑨,𝑩,𝑪\bm{A},\bm{B},\bm{C}bold_italic_A , bold_italic_B , bold_italic_C are assumed to be known. If they are not given in advance, the problem becomes more challenging and is known as system identification: how to learn the parameters 𝑨,𝑩,𝑪\bm{A},\bm{B},\bm{C}bold_italic_A , bold_italic_B , bold_italic_C from many samples of the input sequence {𝒖[n]}\{\bm{u}[n]\}{ bold_italic_u [ italic_n ] } and observation sequence {𝒙[n]}\{\bm{x}[n]\}{ bold_italic_x [ italic_n ] }. This is a classic problem in systems theory. If the system is linear, the input and output sequences {𝒖[n],𝒙[n]}\{\bm{u}[n],\bm{x}[n]\}{ bold_italic_u [ italic_n ] , bold_italic_x [ italic_n ] } jointly lie on a certain low-dimensional subspace131313which has the same dimension as the order of the state-space model (1.3.3).. Hence, the identification problem is essentially equivalent to identifying this low-dimensional subspace [VM96, LV09, LV10].

Note that the above problems have two things in common: first, the noise-free sequences or signals are assumed to be generated by an explicit family of parametric models; second, these models are essentially linear. Conceptually, let 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT be a random variable whose “true” distribution is supported on a low-dimensional linear subspace, say SSitalic_S. To a large extent, the Wiener filter and Kalman filter both try to estimate such an 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT from its noisy observations:

𝒙=𝒙o+ϵ,𝒙oS,\bm{x}=\bm{x}_{o}+\bm{\epsilon},\quad\bm{x}_{o}\sim S,bold_italic_x = bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + bold_italic_ϵ , bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∼ italic_S , (1.3.6)

where ϵ\bm{\epsilon}bold_italic_ϵ is typically a random Gaussian noise (or process). Hence, their solutions rely on identifying a low-dimensional linear subspace that best fits the observed noisy data. By projecting the data onto this subspace, one obtains the optimal denoising operations, all in closed form.

Linear and Mixed Linear Models

Principal component analysis.

From the above problems in classical signal processing and system identification, we see that the main task behind all these problems is to learn a single low-dimensional linear subspace from noisy observations. Mathematically, we may model such a structure as

𝒙=𝒖1z1+𝒖2z2++𝒖dzd+ϵ=𝑼𝒛+ϵ,𝑼D×d,\bm{x}=\bm{u}_{1}z_{1}+\bm{u}_{2}z_{2}+\cdots+\bm{u}_{d}z_{d}+\bm{\epsilon}=\bm{U}\bm{z}+\bm{\epsilon},\quad\bm{U}\in\mathbb{R}^{D\times d},bold_italic_x = bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ⋯ + bold_italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + bold_italic_ϵ = bold_italic_U bold_italic_z + bold_italic_ϵ , bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT , (1.3.7)

where ϵD\bm{\epsilon}\in\mathbb{R}^{D}bold_italic_ϵ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is small random noise. Figure 1.10 illustrates such a distribution with two principal components.

Figure 1.10 : A distribution with two principal components.
Figure 1.10: A distribution with two principal components.

The problem is to find the subspace basis 𝑼\bm{U}bold_italic_U from many samples of 𝒙\bm{x}bold_italic_x. A typical approach is to minimize the projection error onto the subspace:

min𝑼𝔼[𝒙𝑼𝑼𝒛22].\min_{\bm{U}}\operatorname{\mathbb{E}}[\|\bm{x}-\bm{U}\bm{U}^{\top}\bm{z}\|_{2}^{2}].roman_min start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_x - bold_italic_U bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1.3.8)

This is essentially a denoising task: once the basis 𝑼\bm{U}bold_italic_U is correctly found, we can denoise the noisy sample 𝒙\bm{x}bold_italic_x by projecting it onto the low-dimensional subspace spanned by 𝑼\bm{U}bold_italic_U:

𝒙𝒙^=𝑼𝑼𝒙.\bm{x}\rightarrow\hat{\bm{x}}=\bm{U}\bm{U}^{\top}\bm{x}.bold_italic_x → over^ start_ARG bold_italic_x end_ARG = bold_italic_U bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x . (1.3.9)

If the noise is small and the low-dimensional subspace 𝑼\bm{U}bold_italic_U is correctly learned, we expect 𝒙𝒙^\bm{x}\approx\hat{\bm{x}}bold_italic_x ≈ over^ start_ARG bold_italic_x end_ARG. Thus, PCA is a special case of a so-called “auto-encoding” scheme, which encodes the data by projecting it into a lower-dimensional space, and decodes a lower-dimensional code by it into the original space:

𝒙𝑼𝒛𝑼𝒙^.\bm{x}\xrightarrow{\hskip 5.69054pt\bm{U}^{\top}\hskip 5.69054pt}\bm{z}\xrightarrow{\hskip 5.69054pt\bm{U}\hskip 5.69054pt}\hat{\bm{x}}.bold_italic_x start_ARROW start_OVERACCENT bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_z start_ARROW start_OVERACCENT bold_italic_U end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_x end_ARG . (1.3.10)

Because of the simple data structure, the encoder \mathcal{E}caligraphic_E and decoder 𝒟\mathcal{D}caligraphic_D both become simple linear operators (projecting and lifting).

This classic problem in statistics is known as principal component analysis (PCA). It was first studied by Pearson in 1901 [Pea01] and later independently by Hotelling in 1933 [Hot33]. The topic is systematically summarized in the classic book [Jol86, Jol02]. One may also explicitly assume the data 𝒙\bm{x}bold_italic_x is distributed according to a single low-dimensional Gaussian:

𝒙𝒩(𝟎,𝑼𝑼+σ𝑰),𝑼D×d,\bm{x}\sim\mathcal{N}(\bm{0},\bm{U}\bm{U}^{\top}+\sigma\bm{I}),\quad\bm{U}\in\mathbb{R}^{D\times d},bold_italic_x ∼ caligraphic_N ( bold_0 , bold_italic_U bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ bold_italic_I ) , bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT , (1.3.11)

which is equivalent to assuming that 𝒛\bm{z}bold_italic_z in the PCA model (1.3.7) is a standard normal distribution. This problem is known as probabilistic PCA [TB99] and has the same computational solution as PCA.

In this book, we revisit PCA in Chapter 2 from the perspective of learning a low-dimensional distribution. Our goal is to use this simple and idealistic model to convey some of the most fundamental ideas for learning a compact representation of a low-dimensional distribution, including the important notions of compression via denoising and auto-encoding for a consistent representation.

Independent component analysis.

Independent component analysis (ICA) was originally proposed by [BJC85] as a classic model for learning a good representation. In fact, it was originally proposed as a simple mathematical model for our memory. The ICA model takes a deceptively similar form to the above PCA model (1.3.7) by assuming that the observed random variable 𝒙\bm{x}bold_italic_x is a linear superposition of multiple independent components ziz_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:141414In PCA the “components” are the learned vectors, whereas in ICA they are scalars. This is just a difference in names and convention.

𝒙=𝒖1z1+𝒖2z2++𝒖dzd+ϵ=𝑼𝒛+ϵ.\bm{x}=\bm{u}_{1}z_{1}+\bm{u}_{2}z_{2}+\cdots+\bm{u}_{d}z_{d}+\bm{\epsilon}=\bm{U}\bm{z}+\bm{\epsilon}.bold_italic_x = bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ⋯ + bold_italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + bold_italic_ϵ = bold_italic_U bold_italic_z + bold_italic_ϵ . (1.3.12)

However, here the components ziz_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are assumed to be independent non-Gaussian variables. For example, a popular choice is

zi=σiwi,σiB(1,p),z_{i}=\sigma_{i}\cdot w_{i},\quad\sigma_{i}\sim B(1,p),italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_B ( 1 , italic_p ) , (1.3.13)

where σi\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a Bernoulli random variable and wiw_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT could be a constant value or another random variable, say Gaussian.151515Even if wwitalic_w is Gaussian, σw\sigma witalic_σ italic_w is no longer a Gaussian variable! The ICA problem aims to identify both 𝑼\bm{U}bold_italic_U and 𝒛\bm{z}bold_italic_z from observed samples of 𝒙\bm{x}bold_italic_x. Figure 1.11 illustrates the difference between ICA and PCA.

Figure 1.11 : PCA (left) versus ICA (right). Note that PCA finds the orthogonal vectors that approximately span the data, while ICA finds (not necessarily orthogonal) vectors that independently combine to approximately form the data, as well as the coefficients of this combination.
Figure 1.11: PCA (left) versus ICA (right). Note that PCA finds the orthogonal vectors that approximately span the data, while ICA finds (not necessarily orthogonal) vectors that independently combine to approximately form the data, as well as the coefficients of this combination.

Although the (decoding) mapping from 𝒛\bm{z}bold_italic_z to 𝒙\bm{x}bold_italic_x seems linear and straightforward once 𝑼\bm{U}bold_italic_U and 𝒛\bm{z}bold_italic_z are learned, the (encoding) mapping from 𝒙\bm{x}bold_italic_x to 𝒛\bm{z}bold_italic_z can be complicated and may not be represented by a simple linear mapping. Hence, ICA generally gives an auto-encoding of the form:

𝒙𝒛𝑼𝒙^.\bm{x}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{z}\xrightarrow{\hskip 5.69054pt\bm{U}\hskip 5.69054pt}\hat{\bm{x}}.bold_italic_x start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_z start_ARROW start_OVERACCENT bold_italic_U end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_x end_ARG . (1.3.14)

Thus, unlike PCA, ICA is somewhat more difficult to analyze and solve. In the 1990s, researchers such as Erkki Oja and Aapo Hyvärinen [HO97, HO00a] made significant theoretical and algorithmic contributions to ICA. In Chapter 2, we will study and provide a solution to ICA from which the encoding mapping \mathcal{E}caligraphic_E will become clear.

Sparse structures and compressive sensing.

If ppitalic_p in (1.3.13) is very small, the probability that any component is non-zero is small. In this case we say 𝒙\bm{x}bold_italic_x is sparsely generated and concentrates on a union of linear subspaces whose dimension is k=pdk=p\cdot ditalic_k = italic_p ⋅ italic_d. We may therefore extend the ICA model to a more general family of low-dimensional structures known as sparse models.

A kkitalic_k-sparse model is the set of all kkitalic_k-sparse vectors:

𝒵={𝒛n𝒛0k},\mathcal{Z}=\{\bm{z}\in\mathbb{R}^{n}\mid\|\bm{z}\|_{0}\leq k\},caligraphic_Z = { bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ ∥ bold_italic_z ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_k } , (1.3.15)

where 0\|\cdot\|_{0}∥ ⋅ ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT counts the number of non-zero entries. Thus 𝒵\mathcal{Z}caligraphic_Z is the union of all kkitalic_k-dimensional subspaces aligned with the coordinate axes, as illustrated in Figure 1.7 (left). A central problem in signal processing and statistics is to recover a sparse vector 𝒛\bm{z}bold_italic_z from its linear observations

𝒙=𝑨𝒛+ϵ,𝑨m×n,\bm{x}=\bm{A}\bm{z}+\bm{\epsilon},\quad\bm{A}\in\mathbb{R}^{m\times n},bold_italic_x = bold_italic_A bold_italic_z + bold_italic_ϵ , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT , (1.3.16)

where 𝑨\bm{A}bold_italic_A is given, typically m<nm<nitalic_m < italic_n, and ϵ\bm{\epsilon}bold_italic_ϵ is small noise. This seemingly benign problem is NP-hard to solve and even hard to approximate (see [WM22] for details).

Despite a rich history dating back to the eighteenth century [Bos50], no provably efficient algorithm existed for this class of problems, although many heuristic algorithms were proposed between the 1960s and 1990s. Some were effective in practice but lacked rigorous justification. A major breakthrough came in the early 2000s when mathematicians including David Donoho, Emmanuel Candés, and Terence Tao [Don05, CT05a, CT05] established a rigorous theoretical framework that characterizes precise conditions under which the sparse recovery problem can be solved effectively and efficiently via convex 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT minimization:

min𝒛1subject to𝒙𝑨𝒛2ϵ,\min\|\bm{z}\|_{1}\quad\text{subject to}\quad\|\bm{x}-\bm{A}\bm{z}\|_{2}\leq\epsilon,roman_min ∥ bold_italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT subject to ∥ bold_italic_x - bold_italic_A bold_italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ , (1.3.17)

where 1\|\cdot\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the sparsity-promoting 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT norm of a vector and ϵ\epsilonitalic_ϵ is a small constant. Any solution yields a sparse (auto) encoding

𝒙𝒛𝑨𝒙^.\bm{x}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{z}\xrightarrow{\hskip 5.69054pt\bm{A}\hskip 5.69054pt}\hat{\bm{x}}.bold_italic_x start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_z start_ARROW start_OVERACCENT bold_italic_A end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_x end_ARG . (1.3.18)

We will describe such an algorithm (and thus mapping) in Chapter 2, revealing fundamental connections between sparse coding and deep learning.161616Similarities between sparse-coding algorithms and deep networks were noted as early as 2010 [GL10].

Conditions for 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT minimization to succeed are surprisingly general: the minimum number of measurements mmitalic_m required for a successful recovery is only proportional to the intrinsic dimension kkitalic_k. This is the compressive sensing phenomenon [Can06]. It extends to broad families of low-dimensional structures, including low-rank matrices. These results fundamentally changed our understanding of recovering low-dimensional structures in high-dimensional spaces. David Donoho, among others, celebrated this reversal as the “blessing of dimensionality” [D ̵D00], in contrast to the usual pessimistic belief in the “curse of dimensionality.” The complete theory and body of results is presented in [WM22].

The computational significance of this framework cannot be overstated. It transformed problems previously believed intractable into ones that are not only tractable but scalably solvable using extremely efficient algorithms:

intractabletractablescalable.\text{intractable}\;\Longrightarrow\;\text{tractable}\;\Longrightarrow\;\text{scalable}.intractable ⟹ tractable ⟹ scalable . (1.3.19)

The algorithms come with rigorous theoretical guarantees of correctness given precise requirements in data and computation. The deductive nature of this approach contrasts sharply with the empirical, inductive practice of deep neural networks. Yet we now know that both approaches share a common goal—pursuing low-dimensional structures in high-dimensional spaces.

Dictionary learning.

Conceptually, an even harder problem than the sparse-coding problem (1.3.16) arises when the observation matrix 𝑨\bm{A}bold_italic_A is unknown and must itself be learned from a set of (possibly noisy) observations 𝑿=[𝒙1,𝒙2,,𝒙n]\bm{X}=[\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{n}]bold_italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]:

𝑿=𝑨𝒁+𝑬,𝑨m×n.\bm{X}=\bm{A}\bm{Z}+\bm{E},\quad\bm{A}\in\mathbb{R}^{m\times n}.bold_italic_X = bold_italic_A bold_italic_Z + bold_italic_E , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT . (1.3.20)

Here we are given only 𝑿\bm{X}bold_italic_X, not the corresponding 𝒁=[𝒛1,𝒛2,,𝒛n]\bm{Z}=[\bm{z}_{1},\bm{z}_{2},\ldots,\bm{z}_{n}]bold_italic_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] or the noise term 𝑬=[ϵ1,ϵ2,,ϵn]\bm{E}=[\bm{\epsilon}_{1},\bm{\epsilon}_{2},\ldots,\bm{\epsilon}_{n}]bold_italic_E = [ bold_italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], except that the 𝒛i\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are known to be sparse. This is the dictionary learning problem, which generalizes the ICA problem (1.3.12) discussed earlier. In other words, given that the distribution of the data 𝑿\bm{X}bold_italic_X is the image of a standard sparse distribution 𝒁\bm{Z}bold_italic_Z under a linear transform 𝑨\bm{A}bold_italic_A, we wish to learn both 𝑨\bm{A}bold_italic_A and its “inverse” mapping \mathcal{E}caligraphic_E so as to obtain an autoencoder:

𝑿𝒁𝑨𝑿^.\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\bm{A}\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT bold_italic_A end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.3.21)

PCA, ICA, and dictionary learning all assume that the data distribution is supported on or near low-dimensional linear or piecewise-linear structures. Each method requires learning a (global or local) basis for these structures from noisy samples. In Chapter 2 we study how to identify such low-dimensional structures through these classical models. In particular, we will see that all of these low-dimensional (piecewise) linear models can be learned efficiently by the same family of fast algorithms known as power iteration [ZMZ+20]. Although these linear or piecewise-linear models are somewhat idealized for most real-world data, understanding them is an essential first step toward understanding more general low-dimensional distributions.

General Distributions

The distributions of real-world data such as images, videos, and audio are too complex to be modeled by the above, somewhat idealistic, linear models or Gaussian processes. We normally do not know a priori from which family of parametric models they are generated. Historically, many attempts have been made to develop analytical models for these data. In particular, Fields Medalist David Mumford spent considerable effort in the 1990s trying to understand and model the statistics of natural images [Mum96]. He and his students, including Song-Chun Zhu, drew inspiration and techniques from statistical physics and proposed many statistical and stochastic models for the distribution of natural images, such as random fields or stochastic processes [ZWM97, ZM97a, ZM97, HM99, MG99, LPM03]. However, these analytical models achieved only limited success in producing samples that closely resemble natural images. Clearly, for real-world data like images, we need to develop more powerful and unifying methods to pursue their more general low-dimensional structures.

In practice, for a general distribution of real-world data, we typically only have many samples from the distribution—the so-called empirical distribution. In such cases, we normally cannot expect to have a clear analytical form for their low-dimensional structures, nor for the resulting denoising operators.171717This is unlike the cases of PCA, Wiener filter, and Kalman filter. We therefore need to develop a more general solution to these empirical distributions, not necessarily in closed form but at least efficiently computable. If we do this correctly, solutions to the aforementioned linear models should become their special cases.

Denoising.

In the 1950s, statisticians became interested in the problem of denoising data drawn from an arbitrary distribution. Let 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT be a random variable with probability density function po()p_{o}(\cdot)italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( ⋅ ). Suppose we observe a noisy version of 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT:

𝒙=𝒙o+σ𝒈,\bm{x}=\bm{x}_{o}+\sigma\bm{g},bold_italic_x = bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_σ bold_italic_g , (1.3.22)

where 𝒈𝒩(𝟎,𝑰)\bm{g}\sim\operatorname{\mathcal{N}}(\bm{0},\bm{I})bold_italic_g ∼ caligraphic_N ( bold_0 , bold_italic_I ) is standard Gaussian noise and σ\sigmaitalic_σ is the noise level of the observation. Let p()p(\cdot)italic_p ( ⋅ ) be the probability density function of 𝒙\bm{x}bold_italic_x.181818That is, p(𝒙)=φσ(𝒙𝒛)po(𝒛)d𝒛p(\bm{x})=\int_{-\infty}^{\infty}\varphi_{\sigma}(\bm{x}-\bm{z})p_{o}(\bm{z})\mathrm{d}\bm{z}italic_p ( bold_italic_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_italic_x - bold_italic_z ) italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( bold_italic_z ) roman_d bold_italic_z, where φσ\varphi_{\sigma}italic_φ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is the density function of the Gaussian distribution 𝒩(𝟎,σ2𝑰)\mathcal{N}(\bm{0},\sigma^{2}\bm{I})caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ). Remarkably, the posterior expectation of 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT given 𝒙\bm{x}bold_italic_x can be calculated by an elegant formula, known as Tweedie’s formula [Rob56]:191919Herbert Robbins gave the credit for this formula to Maurice Kenneth Tweedie from their personal correspondence.

𝒙^o=𝔼[𝒙o𝒙]=𝒙+σ2logp(𝒙).\hat{\bm{x}}_{o}=\operatorname{\mathbb{E}}[\bm{x}_{o}\mid\bm{x}]=\bm{x}+\sigma^{2}\nabla\log p(\bm{x}).over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = blackboard_E [ bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∣ bold_italic_x ] = bold_italic_x + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_log italic_p ( bold_italic_x ) . (1.3.23)

As can be seen from the formula, the function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) plays a very special role in denoising the observation 𝒙\bm{x}bold_italic_x here. The noise 𝒈\bm{g}bold_italic_g can be explicitly estimated as

𝒈^=𝒙𝒙^oσ=σlogp(𝒙),\hat{\bm{g}}=\frac{\bm{x}-\hat{\bm{x}}_{o}}{\sigma}=-\sigma\nabla\log p(\bm{x}),over^ start_ARG bold_italic_g end_ARG = divide start_ARG bold_italic_x - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG italic_σ end_ARG = - italic_σ ∇ roman_log italic_p ( bold_italic_x ) , (1.3.24)

for which we only need to know the distribution p()p(\cdot)italic_p ( ⋅ ) of 𝒙\bm{x}bold_italic_x, not the ground truth po()p_{o}(\cdot)italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( ⋅ ) for 𝒙o\bm{x}_{o}bold_italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. An important implication of this result is that if we add Gaussian noise to any distribution, the denoising process can be done easily if we can somehow obtain the function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ).

Because this is such an important and useful result, it has been rediscovered and used in many different contexts and areas. For example, after Tweedie’s formula [Rob56], it was rediscovered a few years later by [Miy61] where it was termed “empirical Bayesian denoising.” In the early 2000s, the function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) was rediscovered again in the context of learning a general distribution and was named the “score function” by Aapo Hyvärinen [Hyv05]. Its connection to (empirical Bayesian) denoising was soon recognized by [Vin11]. Generalizations to other measurement distributions (beyond Gaussian noise) have been made by Eero Simoncelli’s group [RS11]. Today, the most direct application of Tweedie’s formula and denoising is image generation via iterative denoising [KS21, HJA20].

Figure 1.12 : Geometric interpretation of a score function ∇ log ⁡ p ​ ( 𝒙 ) \nabla\log p(\bm{x}) ∇ roman_log italic_p ( bold_italic_x ) for a distribution with density p ​ ( 𝒙 ) p(\bm{x}) italic_p ( bold_italic_x ) on the left. The operation generated by the score function pushes the distribution towards areas of higher density. The goal is that, by a certain measure of compactness (e.g. entropy or coding length), the resulting distribution is more “compressed”. Eventually, the distribution converges to one that has a lower-dimensional support, as p ∗ ​ ( 𝒙 ) p^{*}(\bm{x}) italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) shown on the right.
Figure 1.12: Geometric interpretation of a score function logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) for a distribution with density p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) on the left. The operation generated by the score function pushes the distribution towards areas of higher density. The goal is that, by a certain measure of compactness (e.g. entropy or coding length), the resulting distribution is more “compressed”. Eventually, the distribution converges to one that has a lower-dimensional support, as p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) shown on the right.
Entropy minimization.

The score has an intuitive information-theoretic and geometric interpretation. In information theory, logp(𝒙)-\log p(\bm{x})- roman_log italic_p ( bold_italic_x ) corresponds to the number of bits needed to encode 𝒙\bm{x}bold_italic_x202020at least for discrete variables, as detailed in Chapter 3.. The gradient logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) points toward higher-probability-density regions, as shown in Figure 1.12 (left). Moving in this direction reduces the number of bits required to encode 𝒙\bm{x}bold_italic_x. Thus, the operator logp(𝒙)\nabla\log p(\bm{x})∇ roman_log italic_p ( bold_italic_x ) pushes the distribution to “shrink” toward high-density regions. Formally, one can show that the (differential) entropy

H(𝒙)=p(𝒘)logp(𝒘)d𝒘H(\bm{x})=-\int p(\bm{w})\log p(\bm{w})\mathrm{d}\bm{w}italic_H ( bold_italic_x ) = - ∫ italic_p ( bold_italic_w ) roman_log italic_p ( bold_italic_w ) roman_d bold_italic_w (1.3.25)

decreases under this operation (see Chapter 3 and Appendix B). With an optimal codebook, the resulting distribution achieves a lower coding rate and is thus more compressed. Repeating this denoising process indefinitely produces a distribution whose probability mass is concentrated on a support of lower dimension. For instance, the distribution p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) in Figure 1.12 (left) converges to p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) (right):212121Strictly speaking, p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) is a generalized function: p(𝒙)=p(𝒙1)δ(𝒙𝒙1)+p(𝒙2)δ(𝒙𝒙2)p^{*}(\bm{x})=p^{*}(\bm{x}_{1})\delta(\bm{x}-\bm{x}_{1})+p^{*}(\bm{x}_{2})\delta(\bm{x}-\bm{x}_{2})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_δ ( bold_italic_x - bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_δ ( bold_italic_x - bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with p(𝒙1)+p(𝒙2)=1p^{*}(\bm{x}_{1})+p^{*}(\bm{x}_{2})=1italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1.222222Notice that in this section we discuss the process of iteratively denoising and compressing a high-entropy noise distribution until it converges to the low-entropy data distribution. As we will see in Chapter 3, this problem is dual to the problem of learning an optimal encoding for the data distribution, which is also a useful application [RAG+24].

H(𝒙)=p(𝒘)logp(𝒘)d𝒘decreasingH(𝒙)=p(𝒘)logp(𝒘)d𝒘.H(\bm{x})=-\int p(\bm{w})\log p(\bm{w})\mathrm{d}\bm{w}\quad\xrightarrow{\hskip 2.84526pt\text{decreasing}\hskip 2.84526pt}\quad H^{*}(\bm{x})=-\int p^{*}(\bm{w})\log p^{*}(\bm{w})\mathrm{d}\bm{w}.italic_H ( bold_italic_x ) = - ∫ italic_p ( bold_italic_w ) roman_log italic_p ( bold_italic_w ) roman_d bold_italic_w start_ARROW start_OVERACCENT decreasing end_OVERACCENT → end_ARROW italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ) = - ∫ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_w ) roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_w ) roman_d bold_italic_w . (1.3.26)

As the distribution converges to p(𝒙)p^{*}(\bm{x})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x ), its differential entropy approaches negative infinity due to a technical difference between continuous and discrete entropy definitions. Chapter 3 resolves this using a unified rate-distortion measure.

Later in this chapter and Chapter 3, we explore how this simple denoising-compression framework unifies powerful methods for learning low-dimensional distributions in high-dimensional spaces, including natural image distributions.

1.3.2 Empirical Approaches

In practice, it is difficult to model important real-world data—such as images, sounds, and text—with the idealized linear, piecewise-linear, or other analytical models discussed in the previous section. Historically, many empirical models and methods have therefore been proposed. These models often drew inspiration from the biological nervous system, because the brain of an animal or human processes such data with remarkable efficiency and effectiveness.

Classic Artificial Neural Networks

Artificial neuron.
Figure 1.13 : The first mathematical model of an artificial neuron (right) that emulates how a neuron (left) processes signals.
Figure 1.13 : The first mathematical model of an artificial neuron (right) that emulates how a neuron (left) processes signals.
Figure 1.13: The first mathematical model of an artificial neuron (right) that emulates how a neuron (left) processes signals.

Inspired by the nervous system in the brain, the first mathematical model of an artificial neuron232323known as the Linear Threshold Unit, or perceptron. was proposed by Warren McCulloch242424a professor of psychiatry at the University of Chicago at the time and Walter Pitts in 1943 [MP43]. It describes the relationship between the input xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and output ojo_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as:

oj=φ(iwjixi),o_{j}=\varphi\Big{(}\sum_{i}w_{ji}x_{i}\Big{)},italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_φ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1.3.27)

where φ()\varphi(\cdot)italic_φ ( ⋅ ) is some nonlinear activation, typically modeled by a threshold function. This model is illustrated in Figure 1.13. As we can see, this form already shares the main characteristics of a basic unit in modern deep neural networks. The model is derived from observations of how a single neuron works in our nervous system. However, researchers did not know exactly what functions a collection of such neurons could realize and perform. On a more technical level, they were also unsure which nonlinear activation function φ()\varphi(\cdot)italic_φ ( ⋅ ) should be used. Hence, historically many variants have been proposed.252525Step function, hard or soft thresholding, rectified linear unit (ReLU), sigmoid, etc. [DSC22].

Figure 1.14 : A network with one hidden layer (left) versus a deep network (right).
Figure 1.14: A network with one hidden layer (left) versus a deep network (right).
Artificial neural network.

In the 1950s, Frank Rosenblatt was the first to build a machine with a network of such artificial neurons, shown in Figure 1.15. The machine, called Mark I Perceptron, consists of an input layer, an output layer, and a single hidden layer of 512 artificial neurons, as shown in Figure 1.15 left, which is similar to what is illustrated in Figure 1.14 (left). It was designed to classify optical images of letters. However, the capacity of a single-layer network is limited and can only learn linearly separable patterns. In the 1969 book Perceptrons: An Introduction to Computational Geometry by Marvin Minsky and Seymour Papert [MP69], it was shown that the single-layer architecture of Mark I Perceptron cannot learn an XOR function. This result significantly dampened interest in artificial neural networks, even though it was later proven that a multi-layer network can learn an XOR function [RHW86]. In fact, a sufficiently large multi-layer network, as shown in Figure 1.14 (right), consisting of such simple neurons can simulate any finite-state machine, even the universal Turing machine.262626Do not confuse what neural networks are capable of doing in principle with whether it is tractable or easy to learn a neural network that realizes certain desired functions. Nevertheless, the study of artificial neural networks subsequently entered its first winter in the 1970s.

Figure 1.15 : The Mark I Perceptron machine developed by Frank Rosenblatt in the late 1950s.
Figure 1.15 : The Mark I Perceptron machine developed by Frank Rosenblatt in the late 1950s.
Figure 1.15: The Mark I Perceptron machine developed by Frank Rosenblatt in the late 1950s.
Convolutional neural networks.

Early experiments with artificial neural networks such as the Mark I Perceptron in the 1950s and 1960s were somewhat disappointing. They suggested that simply connecting neurons in a general fashion, as in multi-layer perceptrons (MLPs), might not suffice. To build effective and efficient networks, it is extremely helpful to understand the collective purpose or function neurons in the network must achieve so that they can be organized and learned in a specialized way. Thus, at this juncture, once again the study of machine intelligence turned to the animal nervous system for inspiration.

It is known that most of our brain is dedicated to processing visual information [Pla99]. In the 1950s and 1960s, David Hubel and Torsten Wiesel systematically studied the visual cortices of cats. They discovered that the visual cortex contains different types of cells—simple cells and complex cells—which are sensitive to visual stimuli of different orientations and locations [HW59]. Hubel and Wiesel won the 1981 Nobel Prize in Physiology or Medicine for this groundbreaking discovery.

On the artificial neural network side, Hubel and Wiesel’s work inspired Kunihiko Fukushima to design the “neocognitron” network in 1980, which consists of artificial neurons that emulate biological neurons in the visual cortices [Fuk80]. This is known as the first convolutional neural network (CNN), and its architecture is illustrated in Figure 1.16. Unlike the perceptron, the neocognitron had more than one hidden layer and could be viewed as a deep network, as shown in Figure 1.14 (right).

Also inspired by how neurons work in the cat’s visual cortex, Fukushima was the first to introduce the rectified linear unit (ReLU):

φ(x)=max{0,x}={x,if x>0,0,if x0,\varphi(x)=\max\{0,x\}=\begin{cases}x,&\text{if }x>0,\\ 0,&\text{if }x\leq 0,\end{cases}italic_φ ( italic_x ) = roman_max { 0 , italic_x } = { start_ROW start_CELL italic_x , end_CELL start_CELL if italic_x > 0 , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if italic_x ≤ 0 , end_CELL end_ROW (1.3.28)

as the activation function φ()\varphi(\cdot)italic_φ ( ⋅ ) in 1969 [Fuk69]. However, it was not until recent years that ReLU became widely used in modern deep (convolutional) neural networks. This book will explain why ReLU is a good choice once we discuss the main operations deep networks implement: compression.

Figure 1.16 : Origin of convolutional neural networks: the Neocognitron by Kunihiko Fukushima in 1980. Notice that the interleaving layers of convolutions and pooling emulate the functions of simple cells and complex cells discovered in the visual cortices of cats.
Figure 1.16: Origin of convolutional neural networks: the Neocognitron by Kunihiko Fukushima in 1980. Notice that the interleaving layers of convolutions and pooling emulate the functions of simple cells and complex cells discovered in the visual cortices of cats.

CNN-type networks continued to evolve in the 1980s, and many different variants were introduced and studied. However, despite the remarkable capacities of deep networks and the improved architectures inspired by neuroscience, it remained extremely difficult to train such deep networks for real tasks such as image classification. Getting a network to work depended on many unexplainable heuristics and tricks, which limited the appeal and applicability of neural networks. A major breakthrough came around 1989 when Yann LeCun successfully used back propagation (BP) to learn a deep convolutional neural network for recognizing handwritten digits [LBD+89], later known as LeNet (see Figure 1.17). After several years of persistent development, his perseverance paid off: LeNet’s performance eventually became good enough for practical use in the late 1990s [LBB+98]. It was used by the US Post Office for recognizing handwritten digits (for zip codes). LeNet was considered the “prototype” network for all modern deep neural networks, such as AlexNet [KSH12] and ResNet [HZR+16a], which we will discuss later. For this work, Yann LeCun was awarded the 2018 Turing Award.272727Together with two other pioneers of deep networks, Yoshua Bengio and Geoffrey Hinton.

Figure 1.17 : The LeNet-5 convolutional neural network designed by Yann LeCun in 1989.
Figure 1.17: The LeNet-5 convolutional neural network designed by Yann LeCun in 1989.
Backpropagation.

Throughout history, the fate of deep neural networks has been tied to how easily and efficiently they can be trained. Backpropagation (BP) was introduced for this purpose. A multilayer perceptron can be expressed as a composition of linear mappings and nonlinear activations:

h(𝑾1,,𝑾L)=fL(𝑾LfL1(𝑾L1f2(𝑾2f1(𝑾1𝒙)))).h(\bm{W}_{1},\ldots,\bm{W}_{L})=f^{L}(\bm{W}_{L}f^{L-1}(\bm{W}_{L-1}\cdots f^{2}(\bm{W}_{2}f^{1}(\bm{W}_{1}\bm{x})))).italic_h ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) = italic_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_x ) ) ) ) . (1.3.29)

To train the network weights {𝑾l}l=1L\{\bm{W}_{l}\}_{l=1}^{L}{ bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT via gradient descent, we must evaluate the gradient h/𝑾l\partial h/\partial\bm{W}_{l}∂ italic_h / ∂ bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The chain rule in calculus shows that gradients can be computed efficiently for such functions—a technique later termed backpropagation or BP; see Appendix A for details. BP was already known in optimal control and dynamic programming during the 1960s and 1970s, appearing in Paul Werbos’s 1974 PhD thesis [Wer74, Wer94]. In 1986, David Rumelhart et al. were the first to apply it to train a multilayer perceptron (MLP) [RHW86]. Since then, BP has become the dominant technique for training deep networks, as it is scalable and can be efficiently implemented on parallel and distributed computing platforms. However, nature likely does not use BP,282828As we have discussed earlier, nature almost ubiquitously learns to correct errors via closed-loop feedback. as the mechanism is too expensive for physical implementation.292929End-to-end BP is computationally intractable for neural structures as complex as the brain, especially if the updates need to happen in real time. Instead, localized updates to small sections of the neural circuitry are much more biologically feasible. There is a relatively small amount of work on transplanting such “local learning” rules to training deep networks, circumventing BP [BS16, MSS+22, LTP25]. We believe this is an exciting opportunity to improve the scalability of training deep networks. This leaves ample room for future improvement.

Despite the aforementioned algorithmic advances, training deep networks remained finicky and computationally expensive in the 1980s and 1990s. By the late 1990s, support vector machines (SVMs) [CV95] had gained popularity as a superior alternative for classification tasks.303030Similar ideas for classification problems trace back to Thomas Cover’s 1964 PhD dissertation, which was condensed and published in a paper in 1964 [Cov64]. SVMs offered two advantages: a rigorous statistical learning framework known as the Vapnik–Chervonenkis (VC) theory and efficient convex optimization algorithms [BV04]. The rise of SVMs ushered in a second winter for neural networks in the early 2000s.

Compressive auto-encoding.

In the late 1980s and 1990s, artificial neural networks were already being used to learn low-dimensional representations of high-dimensional data such as images. It had been shown that neural networks could learn PCA directly from data [Oja82, BH89], rather than using the classic methods discussed in Section 1.3.1. It was also argued during this period that, due to their ability to model nonlinear transformations, neural networks could learn low-dimensional representations for data with nonlinear distributions. Similar to the linear PCA case, one can simultaneously learn a nonlinear dimension-reduction encoder ffitalic_f and a decoder ggitalic_g, each modeled by a deep neural network [RHW86, Kra91]:

𝑿𝑓𝒁𝑔𝑿^.\bm{X}\xrightarrow{\hskip 5.69054ptf\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054ptg\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_X start_ARROW start_OVERACCENT italic_f end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.3.30)

By enforcing consistency between the decoded data 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG and the original 𝑿\bm{X}bold_italic_X—for example, by minimizing a MSE-type reconstruction error313131MSE-type errors are known to be problematic for imagery data with complex nonlinear structures. As we will soon discuss, much recent work in generative methods, including GANs, has focused on finding better surrogate distance functions between the original data 𝑿\bm{X}bold_italic_X and the regenerated 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG.:

minf,g𝑿𝑿^22,where𝑿^=g(f(𝑿)),\min_{f,g}\|\bm{X}-\hat{\bm{X}}\|_{2}^{2},\qquad\text{where}\quad\hat{\bm{X}}=g(f(\bm{X})),roman_min start_POSTSUBSCRIPT italic_f , italic_g end_POSTSUBSCRIPT ∥ bold_italic_X - over^ start_ARG bold_italic_X end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , where over^ start_ARG bold_italic_X end_ARG = italic_g ( italic_f ( bold_italic_X ) ) , (1.3.31)

an autoencoder can be learned directly from the data 𝑿\bm{X}bold_italic_X.

But how can we guarantee that such an auto-encoding indeed captures the true low-dimensional structures in 𝑿\bm{X}bold_italic_X rather than yielding a trivial redundant representation? For instance, one could simply choose ffitalic_f and ggitalic_g to be identity maps and set 𝒁=𝑿\bm{Z}=\bm{X}bold_italic_Z = bold_italic_X. To ensure the auto-encoding is worthwhile, the resulting representation should be compressive according to some computable measure of complexity. In 1993, Geoffrey Hinton and colleagues proposed using coding length as such a measure, transforming the auto-encoding objective into finding a representation that minimizes coding length [HZ93]. This work established a fundamental connection between the principle of minimum description length [Ris78] and free (Helmholtz) energy minimization. Later work from Hinton’s group [HS06] empirically demonstrated that such auto-encoding can learn meaningful low-dimensional representations for real-world images. Pierre Baldi provided a comprehensive survey of autoencoders in 2011 [Bal11], just before deep networks gained widespread popularity. We will discuss measures of complexity and auto-encoding further in Section 1.4, and present a systematic study of compressive auto-encoding in Chapter 5 from a more unified perspective.

Figure 1.18 : Architecture of LeNet [ LBD+89 ] versus AlexNet [ KSH12 ] .
Figure 1.18: Architecture of LeNet [LBD+89] versus AlexNet [KSH12].

Modern Deep Neural Networks

For nearly 30 years—from the 1980s to the 2010s—neural networks were not taken seriously by the mainstream machine learning community. Early deep networks such as LeNet showed promising performance on small-scale classification problems like digit recognition, yet their design was largely empirical, the available datasets were tiny, and back-propagation was computationally prohibitive for the hardware of the era. These factors led to waning interest and stagnant progress, with only a handful of researchers persisting.

Classification and recognition.

The tremendous potential of deep networks could be unleashed only once sufficient data and computational power became available. By the 2010s, large datasets such as ImageNet had emerged, and GPUs had become powerful enough to make back-propagation affordable even for networks far larger than LeNet. In 2012, the deep convolutional network AlexNet — named for Alex Krizhevsky, one of the authors [KSH12] — attracted widespread attention by surpassing existing classification methods on ImageNet by a significant margin.323232Deep networks had already achieved state-of-the-art results on speech-recognition tasks, but these successes received far less attention than the breakthrough on image classification. Figure 1.18 compares AlexNet with LeNet. AlexNet retains many characteristics of LeNet—it is simply larger and replaces LeNet’s sigmoid activations with ReLUs. This work contributed to Geoffrey Hinton’s 2018 Turing Award.

This early success inspired the machine intelligence community to explore new neural network architectures, variations, and improvements. Empirically, it was discovered that larger and deeper networks yield better performance on tasks such as image classification. Notable architectures that emerged include VGG [SZ15], GoogLeNet [SLJ+14], ResNet [HZR+16], and, more recently, Transformers [VSP+17]. Despite rapid empirically-driven performance improvements, theoretical explanations for these architectures—and the relationships among them, if any —remained scarce. One goal of this book is to uncover the common objective these networks optimize and to explain their shared characteristics, such as multiple layers of linear operators interleaved with nonlinear activations (see Chapter 4).

Reinforcement learning.

Early deep network successes were mainly in supervised classification tasks such as speech and image recognition. Later, deep networks were adopted to learn decision-making or control policies for game playing. In this context, deep networks model the optimal decision/control policy (i.e., a probability distribution over actions to take in order to maximize the expected reward) or the associated optimal value function (an estimate of the expected reward from the given state), as shown in Figure 1.19. Network parameters are incrementally optimized333333Typically via back-propagation (BP). based on rewards returned from the success or failure of playing the game with the current policy. This learning paradigm is called reinforcement learning [SB18]; it originated in control-systems practice of the late 1960s [WF65, MM70] and traces back through a long and rich history to Richard Bellman’s dynamic programming [Bel57] and Marvin Minsky’s trial-and-error learning [Min54] in the 1950s.

From an implementation standpoint, the marriage of deep networks and reinforcement learning proved powerful: deep networks can approximate control policies and value functions for real-world environments that are difficult to model analytically. This culminated in DeepMind’s AlphaGo system, which stunned the world in 2016 by defeating top Go player Lee Sedol and, in 2017, world champion Jie Ke.343434In 1996, IBM’s Deep Blue made history by defeating Russian grandmaster Garry Kasparov in chess using traditional machine learning techniques such as tree search and pruning, methods that have proven less scalable and unsuccessful for more complex games like Go.

AlphaGo’s success surprised the computing community, which had long regarded the game’s state space as too vast for any efficient solution in terms of computation and sample size. The only plausible explanation is that the optimal value and policy function of Go possess significant favorable structure: qualitatively speaking, their intrinsic dimensions are low enough so that they can be well approximated by neural networks learnable from a manageable number of samples.

Figure 1.19 : AlphaGo: using deep neural networks to model the optimal policy or value function for the game Go.
Figure 1.19: AlphaGo: using deep neural networks to model the optimal policy or value function for the game Go.
Generation and prediction.

From a certain perspective, the early practice of deep networks in the 2010s was focused on extracting relevant information from the data 𝑿\bm{X}bold_italic_X and encoding it into a task-specific representation 𝒁\bm{Z}bold_italic_Z (say 𝒁\bm{Z}bold_italic_Z denotes class labels in classification tasks353535This is a highly atypical notation for labels — the usual notation is 𝒚\bm{y}bold_italic_y — but it is useful for our purposes to consider labels as highly compressed and sparse representations of the data. See Example 1.2 for more details.):

𝑿𝑓𝒁.\bm{X}\xrightarrow{\hskip 5.69054ptf\hskip 5.69054pt}\bm{Z}.bold_italic_X start_ARROW start_OVERACCENT italic_f end_OVERACCENT → end_ARROW bold_italic_Z . (1.3.32)

Here the learned mapping ffitalic_f needs not preserve most distributional information about 𝑿\bm{X}bold_italic_X; it suffices to retain only the sufficient statistics for the task. For example, a sample 𝒙𝑿\bm{x}\in\bm{X}bold_italic_x ∈ bold_italic_X might be an image of an apple, mapped by ffitalic_f to the class label 𝒛=“apple”\bm{z}=\text{``apple''}bold_italic_z = “apple”.

In many modern settings—such as the so-called large general-purpose (“foundation”) models—we may need to also decode 𝒁\bm{Z}bold_italic_Z to recover the corresponding 𝑿\bm{X}bold_italic_X to a prescribed precision:

𝒁𝑔𝑿^.\bm{Z}\xrightarrow{\hskip 5.69054ptg\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_Z start_ARROW start_OVERACCENT italic_g end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.3.33)

Because 𝑿\bm{X}bold_italic_X typically represents data observed from the external world, a good decoder allows us to simulate or predict what happens in the world. In a “text-to-image” or “text-to-video” task, for instance, 𝒛\bm{z}bold_italic_z is the text describing the desired image 𝒙\bm{x}bold_italic_x, and the decoder should generate an 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG whose content matches 𝒙\bm{x}bold_italic_x. Given an object class 𝒛=“apple”\bm{z}=\text{``apple''}bold_italic_z = “apple”, the decoder ggitalic_g should produce an image 𝒙^\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG that looks like an apple, though not necessarily identical to the original 𝒙\bm{x}bold_italic_x.

Generation via discriminative approaches.

For the generated images 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG to resemble true natural images 𝑿\bm{X}bold_italic_X, we must be able to evaluate and minimize some distance:

mindist(𝑿,𝑿^).\min\operatorname{dist}(\bm{X},\hat{\bm{X}}).roman_min roman_dist ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) . (1.3.34)

As it turns out, most theoretically motivated distances are extremely difficult—if not impossible—to compute and optimize for distributions in high-dimensional space with low intrinsic dimension.363636This remains true even when a parametric family of distributions for 𝑿\bm{X}bold_italic_X is specified. The distance often becomes ill-conditioned or ill-defined for distributions with low-dimensional supports. Worse still, the chosen family might poorly approximate the true distribution of interest.

In 2007, Zhuowen Tu, disappointed by early analytical attempts to model and generate natural images, decided to try a drastically different approach. In a paper published at CVPR 2007 [Tu07], he first suggested learning a generative model for images via a discriminative approach. The idea is simple: if evaluating the distance dist(𝑿,𝑿^)\operatorname{dist}(\bm{X},\hat{\bm{X}})roman_dist ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) proves difficult, one can instead learn a discriminator dditalic_d to separate 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG from 𝑿\bm{X}bold_italic_X:

𝒁\bm{Z}bold_italic_Z𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG𝑿\bm{X}bold_italic_X{0,1}\{0,1\}{ 0 , 1 }ggitalic_gdditalic_ddditalic_d (1.3.35)

where labels 0 and 111 indicate whether an image is generated or real. Intuitively, the harder it becomes to separate 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG and 𝑿\bm{X}bold_italic_X, the closer they likely are.

Tu’s work [Tu07] first demonstrated the feasibility of learning a generative model from a discriminative approach. However, the work employed traditional methods for image generation and distribution classification (such as boosting), which proved slow and difficult to implement. After 2012, deep neural networks gained popularity for image classification. In 2014, Ian Goodfellow and colleagues again proposed generating natural images with a discriminative approach [GPM+14]. They suggested modeling both the generator ggitalic_g and discriminator dditalic_d with deep neural networks. Moreover, they proposed learning ggitalic_g and dditalic_d via a minimax game:

mingmaxd(𝑿,𝑿^),\min_{g}\max_{d}\ell(\bm{X},\hat{\bm{X}}),roman_min start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT roman_ℓ ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) , (1.3.36)

where ()\ell(\cdot)roman_ℓ ( ⋅ ) is some natural loss function associated with the classification task. In this formulation, the discriminator dditalic_d maximizes its success in separating 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG, while the generator ggitalic_g does the opposite. This approach is named generative adversarial networks (GANs). It was shown that once trained on a large dataset, GANs can indeed generate photo-realistic images. Partially due to this work’s influence, Yoshua Bengio received the 2018 Turing Award.

The discriminative approach appears to cleverly bypass a fundamental difficulty in distribution learning. However, rigorously speaking, this approach does not fully resolve the fundamental difficulty. It is shown in [GPM+14] that with a properly chosen loss, the minimax formulation becomes mathematically equivalent to minimizing the Jensen-Shannon distance (see [CT91]) between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG. This remains a hard problem for two low-dimensional distributions in high-dimensional space. Consequently, GANs typically rely on many heuristics and engineering tricks and often suffer from instability issues such as mode collapsing.373737Nevertheless, such a minimax formulation provides a practical approximation of the distance. It simplifies implementation and avoids certain caveats in directly computing the distance.

Generation via denoising and diffusion.

In 2015, shortly after GANs were introduced and gained popularity, Surya Ganguli and his students proposed that an iterative denoising process modeled by a deep network could be used to learn a general distribution, such as that of natural images [SWM+15]. Their method was inspired by properties of special Gaussian and binomial processes studied by William Feller in 1949 [Fel49].383838Again, in the magical era of the 1940s!

Figure 1.20 : Illustration of an iterative denoising and compressing process that, starting from a Gaussian distribution q 0 = 𝒩 ​ ( 𝟎 , 𝑰 ) q^{0}=\mathcal{N}(\bm{0},\bm{I}) italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_N ( bold_0 , bold_italic_I ) , converges to an arbitrary low-dimensional data distribution q L = p q^{L}=p italic_q start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_p .
Figure 1.20: Illustration of an iterative denoising and compressing process that, starting from a Gaussian distribution q0=𝒩(𝟎,𝑰)q^{0}=\mathcal{N}(\bm{0},\bm{I})italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_N ( bold_0 , bold_italic_I ), converges to an arbitrary low-dimensional data distribution qL=pq^{L}=pitalic_q start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_p.

Soon, denoising operators based on the score function [Hyv05], briefly introduced in Section 1.3.1, were shown to be more general and unified the denoising and diffusion processes and algorithms [SE19, SSK+21, HJA20]. Figure 1.20 illustrates the process that transforms a generic Gaussian distribution q0=𝒩(𝟎,𝑰)q^{0}=\mathcal{N}(\bm{0},\bm{I})italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_N ( bold_0 , bold_italic_I ) to an (arbitrary) empirical distribution p(𝒙)p(\bm{x})italic_p ( bold_italic_x ) by performing a sequence of iterative denoising (or compressing) operations:

𝒛0𝒩(𝟎,𝑰)g0𝒛1g1gL1𝒛Lp(𝒙).\bm{z}^{0}\sim\mathcal{N}(\bm{0},\bm{I})\xrightarrow{\hskip 5.69054ptg^{0}\hskip 5.69054pt}\bm{z}^{1}\xrightarrow{\hskip 5.69054ptg^{1}\hskip 5.69054pt}\cdots\xrightarrow{\hskip 5.69054ptg^{L-1}\hskip 5.69054pt}\bm{z}^{L}\sim p(\bm{x}).bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) start_ARROW start_OVERACCENT italic_g start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_g start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW ⋯ start_ARROW start_OVERACCENT italic_g start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW bold_italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∼ italic_p ( bold_italic_x ) . (1.3.37)

By now, denoising (and diffusion) has replaced GANs and become the mainstream method for learning distributions of images and videos, leading to popular commercial image generation engines such as Midjourney and Stability.ai. In Chapter 3 we will systematically introduce and study the denoising and diffusion method for learning a general low-dimensional distribution.

1.4 A Unifying Approach

So far, we have given a brief account of the main objective and history of machine intelligence, along with many important ideas and approaches associated with it. In recent years, following the empirical success of deep neural networks, tremendous efforts have been made to develop theoretical frameworks that help us understand empirically designed deep networks—whether specific seemingly necessary components (e.g., dropout [SHK+14], normalization [IS15, BKH16], attention [VSP+17]) or their overall behaviors (e.g., double descent [BHM+19], neural collapse [PHD20]).

Motivated in part by this trend, this book pursues several important and challenging goals:

  • Develop a theoretical framework that yields rigorous mathematical interpretations of deep neural networks.

  • Ensure correctness of the learned data distribution and consistency with the learned representation.

  • Demonstrate that the framework leads to performant architectures and guides further practical improvements.

In recent years, mounting evidence suggests these goals can indeed be achieved by leveraging the theory and solutions of the classical analytical low-dimensional models discussed earlier (treated more thoroughly in Chapter 2) and by integrating fundamental ideas from related fields—namely information/coding theory, control/game theory, and optimization. This book aims to provide a systematic introduction to this new approach.

1.4.1 Learning Parsimonious Representations

One necessary condition for any learning task to be possible is that the sequences of interest must be computable, at least in the sense of Alan Turing [Tur36]. That is, a sequence can be computed via a program on a typical computer.393939There are indeed well-defined sequences that are not computable. In addition to being computable, we require computation to be tractable.404040We do not need to consider predicting things whose computational complexity is intractable—say, grows exponentially in the length or dimension of the sequence. That is, the computational cost (space and time) for learning and computing the sequence should not grow exponentially in length. Furthermore, as we see in nature (and in the modern practice of machine intelligence), for most practical tasks an intelligent system needs to learn what is predictable from massive data in a very high-dimensional space, such as from vision, sound, and touch. Hence, for intelligence we do not need to consider all computable and tractable sequences or structures; we should focus only on predictable sequences and structures that admit scalable realizations of their learning and computing algorithms:

computabletractablescalable.\text{{computable}}\;\Longrightarrow\;\text{{tractable}}\;\Longrightarrow\;\text{{scalable}}.computable ⟹ tractable ⟹ scalable . (1.4.1)

This is because whatever algorithms intelligent beings use to learn useful information must be scalable. More specifically, the computational complexity of the algorithms should scale gracefully—typically linear or even sublinear—in the size and dimension of the data. On the technical level, this requires that the operations the algorithms rely on to learn can only utilize oracle information that can be efficiently computed from the data. More specifically, when the dimension is high and the scale is large, the only oracle one can afford to compute is either the first-order geometric information about the data414141such as approximating a nonlinear structure locally with linear subspaces and computing the gradient of an objective function. or the second-order statistical information424242such as covariance or correlation of the data or their features.. The main goal of this book is to develop a theoretical and computational framework within which we can systematically develop efficient and effective solutions or algorithms with such scalable oracles and operations to learn low-dimensional structures from the sampled data and subsequently the predictive function.

Pursuing low-dimensionality via compression.

From the examples of sequences we gave in Section 1.2.1, it is clear that some sequences are easy to model and compute, while others are more difficult. The computational cost of a sequence depends on the complexity of the predicting function ffitalic_f. The higher the degree of regression dditalic_d, the more costly it is to compute. The function ffitalic_f could be a simple linear function, or it could be a nonlinear function that is arbitrarily difficult to specify and compute.

It is reasonable to believe that if a sequence is harder—by whatever measure we choose—to specify and compute, then it will also be more difficult to learn from its sampled segments. Nevertheless, for any given predictable sequence, there are in fact many, often infinitely many, ways to specify it. For example, for the simple sequence xn+1=axnx_{n+1}=ax_{n}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_a italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we could also define the same sequence with xn+1=axn+bxn1bxn1x_{n+1}=ax_{n}+bx_{n-1}-bx_{n-1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_a italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_b italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_b italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT. Hence it would be very useful to develop an objective and rigorous notion of “complexity” for any given computable sequence.

Andrey Kolmogorov, a Russian mathematician, was one of the first to define complexity for any computable sequence.434343Many contributed to this notion of sequence complexity, most notably Ray Solomonoff and Greg Chaitin. All three developed algorithmic information theory independently—Solomonoff in 1960, Kolmogorov in 1965 [Kol98], and Chaitin around 1966 [Cha66]. He proposed that among all programs computing the same sequence, the length of the shortest program measures its complexity. This aligns with Occam’s Razor—choose the simplest theory explaining the same observation. Formally, let ppitalic_p be a program generating sequence SSitalic_S on universal computer 𝒰\mathcal{U}caligraphic_U. The Kolmogorov complexity of SSitalic_S is:

K(S)=minp:𝒰(p)=SL(p).K(S)=\min_{p\,:\,\mathcal{U}(p)=S}L(p).italic_K ( italic_S ) = roman_min start_POSTSUBSCRIPT italic_p : caligraphic_U ( italic_p ) = italic_S end_POSTSUBSCRIPT italic_L ( italic_p ) . (1.4.2)

Thus, complexity measures how parsimoniously we can specify or compute the sequence. This definition is conceptually important and historically inspired profound studies in computational complexity and theoretical computer science.

The shortest program length represents the ultimate compression of the sequence, quantifying our gain from learning its generative mechanism. However, Kolmogorov complexity is generally uncomputable [CT91] and intractable to approximate. Consequently, it has little practical use—it cannot predict learning difficulty or assess how well we have learned.

Computable measure of parsimony.

For practical purposes, we need an efficiently computable measure of complexity for sequences generated by the same predicting function.444444In practice, we typically care about learning the predicting function ffitalic_f, rather than any particular sequence generated by ffitalic_f. Note that part of the reason Kolmogorov complexity is not computable is that its definition is non-constructive.

To introduce a computable measure of complexity, we may take a more constructive approach, as advocated by Claude Shannon through the framework of information theory [Sha48, CT91].454545This framework has successfully guided engineering practice in the communication industry for over 80 years. By assuming the sequence SSitalic_S is drawn from a probabilistic distribution p(S)p(S)italic_p ( italic_S ), the entropy of the distribution:464646Here we consider differential entropy as we assume the sequence consists of continuous variables. For discrete variables, we would use H(S)=ip(si)logp(si)H(S)=-\sum_{i}p(s_{i})\log p(s_{i})italic_H ( italic_S ) = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

h(S)p(s)logp(s)dsh(S)\doteq-\int p(s)\log p(s)\mathrm{d}sitalic_h ( italic_S ) ≐ - ∫ italic_p ( italic_s ) roman_log italic_p ( italic_s ) roman_d italic_s (1.4.3)

provides a natural measure of its complexity. This measure also has a natural interpretation as the average number of binary bits needed to encode such a sequence, as we will see in Chapter 3.

To illustrate the main ideas of this view, let us take a large number of long sequence segments:

{S1,S2,,Si,,SN}D,\{S_{1},S_{2},\ldots,S_{i},\ldots,S_{N}\}\subset\mathbb{R}^{D},{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , (1.4.4)

generated by a predicting function ffitalic_f. Without loss of generality, we assume all sequences have the same length DDitalic_D, so each sequence can be viewed as a vector in D\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. We introduce a coding scheme (with a codebook), denoted \mathcal{E}caligraphic_E, that maps every segment SiS_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a unique stream of binary bits (Si)\mathcal{E}(S_{i})caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and a corresponding decoding scheme 𝒟\mathcal{D}caligraphic_D that maps binary bit strings back to sequence segments (say subject to error ϵ0\epsilon\geq 0italic_ϵ ≥ 0 as such an encoding scheme can be lossy). The simplest scheme fills the space spanned by all segments with ϵ\epsilonitalic_ϵ-balls, as shown in Figure 1.21. We then number the balls, and each sequence is encoded as the binary index of its closest ball, while each binary index is decoded back to the center of the corresponding ball. Thus, each segment can be recovered up to precision ϵ\epsilonitalic_ϵ from its bit stream.

Figure 1.21 : Comparison of two coding schemes. Imagine the true data distribution lies along the two arrowed lines. Samples from these lines can be encoded with a codebook of all blue balls, or with a codebook of only the green balls. The second scheme yields a much lower coding length/rate subject to the same precision.
Figure 1.21: Comparison of two coding schemes. Imagine the true data distribution lies along the two arrowed lines. Samples from these lines can be encoded with a codebook of all blue balls, or with a codebook of only the green balls. The second scheme yields a much lower coding length/rate subject to the same precision.

Then, for an encoding \mathcal{E}caligraphic_E, the complexity of the predicting function ffitalic_f can be evaluated as the average coding length L((S))L(\mathcal{E}(S))italic_L ( caligraphic_E ( italic_S ) ) of all sequences SSitalic_S that it generates, known as the coding rate:474747One may make this more precise by taking R(S)R(S\mid\mathcal{E})italic_R ( italic_S ∣ caligraphic_E ) to be the expected coding length for all segments SSitalic_S of length DDitalic_D.

R(S)=𝔼[L((S))]1Ni=1NL((Si)).R(S\mid\mathcal{E})=\operatorname{\mathbb{E}}[L(\mathcal{E}(S))]\approx\frac{1}{N}\sum_{i=1}^{N}L(\mathcal{E}(S_{i})).italic_R ( italic_S ∣ caligraphic_E ) = blackboard_E [ italic_L ( caligraphic_E ( italic_S ) ) ] ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (1.4.5)

The coding rate measure will change if we use a different coding scheme (or codebook). In practice, the better we know the low-dimensional structure around which the segments are distributed, the more efficient a codebook we can design, as illustrated in Figure 1.21. Astute readers may recognize that conceptually the denoising process illustrated in Figure 1.20 closely resembles going from the coding scheme with the blue balls to that with the green ones.

Given two coding schemes 1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the segments, if the difference in the coding rates is positive:

R(S1)R(S2)>0,R(S\mid\mathcal{E}_{1})-R(S\mid\mathcal{E}_{2})>0,italic_R ( italic_S ∣ caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_R ( italic_S ∣ caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > 0 , (1.4.6)

we may say the coding scheme 2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is better. This difference can be viewed as a measure of how much more information 2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has over 1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT about the distribution of the data. To a large extent, the goal of learning the data distribution is equivalent to finding the most efficient encoding and decoding scheme that minimizes the coding rate subject to a desired precision:

min,𝒟R(S)subject todist(S,𝒟((S)))ϵ.\min_{\mathcal{E},\mathcal{D}}R(S\mid\mathcal{E})\quad\text{subject to}\quad\operatorname{dist}(S,\mathcal{D}(\mathcal{E}(S)))\leq\epsilon.roman_min start_POSTSUBSCRIPT caligraphic_E , caligraphic_D end_POSTSUBSCRIPT italic_R ( italic_S ∣ caligraphic_E ) subject to roman_dist ( italic_S , caligraphic_D ( caligraphic_E ( italic_S ) ) ) ≤ italic_ϵ . (1.4.7)

As we will see in Chapter 3, the achievable minimal rate is closely related to the notion of entropy H(S)H(S)italic_H ( italic_S ) (1.4.3).

Remark 1.1.

The perspective of measuring data complexity with explicit encoding schemes has motivated several learning objectives proposed to revise Kolmogorov complexity for better computability [WD99], including the minimum message length (MML) proposed in 1968 [WB68] and the minimum description length (MDL) in 1978 [Ris78, HY01]. These objectives normally count the coding length for the coding scheme \mathcal{E}caligraphic_E itself (including its codebook) in addition to the data SSitalic_S of interest: L((S))+L()L(\mathcal{E}(S))+L(\mathcal{E})italic_L ( caligraphic_E ( italic_S ) ) + italic_L ( caligraphic_E ). However, if the goal is to learn a finite-sized codebook and apply it to a large number of sequences, the amortized cost of the codebook can be ignored since

1N(L()+i=1NL((Si)))1Ni=1NL((Si))\frac{1}{N}\left(L(\mathcal{E})+\sum_{i=1}^{N}L(\mathcal{E}(S_{i}))\right)\approx\frac{1}{N}\sum_{i=1}^{N}L(\mathcal{E}(S_{i}))divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( italic_L ( caligraphic_E ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( caligraphic_E ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (1.4.8)

as NNitalic_N becomes large.

Again, one may view the resulting optimal coding scheme as the one that achieves the best compression of the observed data. In general, compared to the Kolmogorov complexity, the coding length given by any encoding scheme will always be larger:

K(S)<L((S)).K(S)<L(\mathcal{E}(S)).italic_K ( italic_S ) < italic_L ( caligraphic_E ( italic_S ) ) . (1.4.9)

Therefore, minimizing the coding rate/length essentially minimizes an upper bound of the otherwise uncomputable Kolmogorov complexity.

1.4.2 Learning Informative Representations

If the goal were simply to compress the given data for its own sake, the optimal codes approaching Kolmogorov complexity would in theory become nearly random or structureless [Cha66].484848Any codes with structure can be further compressed. However, our true purpose in learning the data distribution of sequences SSitalic_S is to use it repeatedly with ease in future predictions. Hence, while compression allows us to identify the low-dimensional distribution in the data, we would like to encode this distribution in a structured and organized way so that the resulting representation is informative and efficient to use.494949For example, to sample the distribution under different conditions. Figure 1.22 illustrates intuitively why such a transformation is desirable.

Figure 1.22 : Transforming the identified low-dimensional data distribution into a more informative and structured representation.
Figure 1.22: Transforming the identified low-dimensional data distribution into a more informative and structured representation.

As we will show in Chapter 3, these desired structures in the final representation can be precisely promoted by choosing a natural measure of information gain based on the coding rates of the chosen coding schemes.505050The information gain is a pervasive idea in algorithmic information theory and has been incorporated in several approaches to this topic, most recently 𝒱\mathcal{V}caligraphic_V-information [XZS+20]. This particular instantiation, similarly to many other measures of complexity we discussed, suffers from being increasingly poorly-conditioned as the data support becomes more compact and low-dimensional, and so is not suitable for our needs. Throughout this book, such an explicit and constructive coding approach provides a powerful computational framework for learning good representations of low-dimensional structures in real-world data. In many cases of practical importance, the coding length function can be efficiently computed or accurately approximated. In some benign cases, we can even obtain closed-form formulae—for example, subspace and Gaussian models (see Chapter 3).

Moreover, this computational framework leads to a principled approach that naturally reveals the role deep networks play in this learning process. As we will derive systematically in Chapter 4, the layers of a deep network perform operations that incrementally optimize the objective function of interest. From this perspective, the role of deep networks can be precisely interpreted as emulating a certain iterative optimization algorithm—say, gradient descent—to optimize the objective of information gain. Layers of the resulting deep architectures can be endowed with precise statistical and geometric interpretations, namely performing incremental compressive encoding and decoding operations. Consequently, the derived deep networks become transparent “white boxes” that are mathematically fully explainable.

1.4.3 Learning Consistent Representations

To summarize our discussions so far, let us denote the data as

𝑿={S1,S2,,Si,,SN}D,\bm{X}=\{S_{1},S_{2},\ldots,S_{i},\ldots,S_{N}\}\subset\mathbb{R}^{D},bold_italic_X = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , (1.4.10)

and let 𝒁=(𝑿)\bm{Z}=\mathcal{E}(\bm{X})bold_italic_Z = caligraphic_E ( bold_italic_X ) be the codes of 𝑿\bm{X}bold_italic_X via some encoder \mathcal{E}caligraphic_E:

𝑿𝒁.\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}.bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z . (1.4.11)

In the machine learning context, 𝒁\bm{Z}bold_italic_Z is often called the “features” of 𝑿\bm{X}bold_italic_X. Note that without knowing the underlying distribution of 𝑿\bm{X}bold_italic_X, we do not know which encoder \mathcal{E}caligraphic_E should be used to retain the most useful information about the distribution of 𝑿\bm{X}bold_italic_X. In practice, people often start by trying a compact encoding scheme that serves a specific task well. In particular, they try to learn an encoder that optimizes a certain (empirical) measure of parsimony for the learned representation:

minρ(𝒁).\min\rho(\bm{Z}).roman_min italic_ρ ( bold_italic_Z ) . (1.4.12)
Example 1.2.

For example, image classification is such a case: we assign all images in the same class to a single code and images in different classes to different codes, say “one-hot” vectors:

𝒙𝒛{[1,0,0,,0,0],[0,1,0,0,0],,[0,0,0,,0,1].}\bm{x}\mapsto\bm{z}\in\{[1,0,0,\ldots,0,0],\;[0,1,0\ldots,0,0],\;\ldots,\;[0,0,0,\ldots,0,1].\}bold_italic_x ↦ bold_italic_z ∈ { [ 1 , 0 , 0 , … , 0 , 0 ] , [ 0 , 1 , 0 … , 0 , 0 ] , … , [ 0 , 0 , 0 , … , 0 , 1 ] . } (1.4.13)

Now, a classifier f()f(\cdot)italic_f ( ⋅ ) can be modeled as a function that predicts the probability of a given 𝒙\bm{x}bold_italic_x belonging to each of the kkitalic_k classes: 𝒛^=f(𝒙)K\hat{\bm{z}}=f(\bm{x})\in\mathbb{R}^{K}over^ start_ARG bold_italic_z end_ARG = italic_f ( bold_italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Then the “goodness” of a classifier can be measured by the so-called cross entropy:515151The cross entropy can be viewed as a distance measure between the ground truth distribution of 𝒛\bm{z}bold_italic_z and that of the prediction 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG. It can also be viewed as the expected coding length of 𝒛\bm{z}bold_italic_z if we use the optimal code book for 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG to encode 𝒛\bm{z}bold_italic_z. The cross entropy reaches its minimum when 𝒛\bm{z}bold_italic_z and 𝒛^\hat{\bm{z}}over^ start_ARG bold_italic_z end_ARG have the same distribution.

(𝒛^,𝒛)=k=1Kzklogz^k,\ell(\hat{\bm{z}},\bm{z})=\sum_{k=1}^{K}-z_{k}\log\hat{z}_{k},roman_ℓ ( over^ start_ARG bold_italic_z end_ARG , bold_italic_z ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (1.4.14)

where zkz_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT indicates the kkitalic_k-th entry of the vector 𝒛\bm{z}bold_italic_z. As the early practice of deep networks indicated [KSH12], if enough data are given, such an encoding scheme can often be represented by a deep network and learned in an end-to-end fashion by optimizing the cross-entropy. \blacksquare

The cross-entropy loss (𝒛^,𝒛)\ell(\hat{\bm{z}},\bm{z})roman_ℓ ( over^ start_ARG bold_italic_z end_ARG , bold_italic_z ) can be viewed as a special measure of parsimony ρ(𝒛)\rho(\bm{z})italic_ρ ( bold_italic_z ) associated with a particular family of encoding schemes that are suitable for classification. However, such an encoding is obviously very lossy. The learned 𝒛\bm{z}bold_italic_z does not contain any other information about 𝒙\bm{x}bold_italic_x except for its class type. For example, by assigning an image with (a code representing) the class label “apple”, we no longer know which specific type of apple is in the original image from the label alone.

Of course, the other extreme is to require the coding scheme to be lossless. That is, there is a one-to-one mapping between 𝒙\bm{x}bold_italic_x and its code 𝒛\bm{z}bold_italic_z. However, as we will see in Chapter 3, lossless coding (or compression) is impractical unless 𝒙\bm{x}bold_italic_x is discrete. For a continuous random variable, we may only consider lossy coding schemes so that the coding length for the data can be finite. That is, we only encode the data up to a certain prescribed precision. As we will elaborate more in Chapter 3, lossy coding is not merely a practical choice; it plays a fundamental role in making learning of the underlying continuous distribution possible from finite samples of the distribution.

For many learning purposes, we want the feature 𝒛\bm{z}bold_italic_z, although lossy, to retain more information about 𝒙\bm{x}bold_italic_x than just its class type. In this book, we will introduce a more general measure of parsimony based on coding length/rate associated with a more general family of coding schemes—coding with a mixture of subspaces or Gaussians. This family has the capability to closely approximate arbitrary real-world distributions up to a certain precision. As we will see in Chapter 3 and Chapter 4, such a measure will not only preserve most information about the distribution of 𝑿\bm{X}bold_italic_X but will also promote certain desirable geometric and statistical structures for the learned representation 𝒁\bm{Z}bold_italic_Z.

Bidirectional auto-encoding for consistency.

In a broader learning context, the main goal of a compressive coding scheme \mathcal{E}caligraphic_E is to identify the low-dimensional structures in the data 𝑿\bm{X}bold_italic_X so that they can be used to predict things in the original data space. This requires that the learned encoding scheme \mathcal{E}caligraphic_E admits an efficient decoding scheme, denoted 𝒟\mathcal{D}caligraphic_D, which maps the feature or encoded data 𝒁\bm{Z}bold_italic_Z back to the data space:

𝑿𝒁𝒟𝑿^.\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\mathcal{D}\hskip 5.69054pt}\hat{\bm{X}}.bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG . (1.4.15)

We call such an encoding–decoding pair (,𝒟)(\mathcal{E},\mathcal{D})( caligraphic_E , caligraphic_D ) an auto-encoding. Figure 1.23 illustrates this process.

Figure 1.23 : Illustration of the architecture of an autoencoder.
Figure 1.23: Illustration of the architecture of an autoencoder.

Ideally, the decoding is approximately an “inverse” of the encoding, so that the data (distribution) 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG decoded from 𝒁\bm{Z}bold_italic_Z is similar to the original data (distribution) 𝑿\bm{X}bold_italic_X to some extent.525252We will make precise what we mean by “similar” later. If this holds, we can recover or predict from 𝒁\bm{Z}bold_italic_Z what occurs in the original data space. In this case, we say the pair (,𝒟)(\mathcal{E},\mathcal{D})( caligraphic_E , caligraphic_D ) provides a consistent auto-encoding. For most practical purposes, we not only need such a decoding to exist, but also need it to be efficiently realizable and physically implementable. For example, if 𝒙\bm{x}bold_italic_x is real-valued, quantization is required for any decoding scheme to be realizable on a finite-state machine, as we will explain in Chapter 3. Thus, in general, realizable encoding and decoding schemes are necessarily lossy. A central question is how to learn a compact (lossy) representation 𝒁\bm{Z}bold_italic_Z that can still predict 𝑿\bm{X}bold_italic_X well.

Generally speaking, both the encoder and decoder can be modeled and realized by deep networks and learned by solving an optimization problem of the following form:

minf,g[(𝑿,𝑿^)+ρ(𝒁)],where𝒁=f(𝑿),𝑿^=g(𝒁)\min_{f,g}[\ell(\bm{X},\hat{\bm{X}})+\rho(\bm{Z})],\qquad\text{where}\quad\bm{Z}=f(\bm{X}),\quad\hat{\bm{X}}=g(\bm{Z})roman_min start_POSTSUBSCRIPT italic_f , italic_g end_POSTSUBSCRIPT [ roman_ℓ ( bold_italic_X , over^ start_ARG bold_italic_X end_ARG ) + italic_ρ ( bold_italic_Z ) ] , where bold_italic_Z = italic_f ( bold_italic_X ) , over^ start_ARG bold_italic_X end_ARG = italic_g ( bold_italic_Z ) (1.4.16)

where (,)\ell(\cdot,\cdot)roman_ℓ ( ⋅ , ⋅ ) is a distance function that promotes similarity between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG535353Either sample-wise or distribution-wise, depending on the choice of \ellroman_ℓ. and ρ(𝒁)\rho(\bm{Z})italic_ρ ( bold_italic_Z ) is a measure that promotes parsimony and information richness of 𝒁\bm{Z}bold_italic_Z. The classic principal component analysis (PCA) [Jol02] is a typical example of a consistent auto-encoding, which we will study in great detail in Chapter 2, as a precursor to more general low-dimensional structures. In Chapter 5, we will study how to learn consistent auto-encoding for general (say nonlinear) low-dimensional distributions.

1.4.4 Learning Self-Consistent Representations

In the auto-encoding objective above, one must evaluate how close the decoded data 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG is to the original 𝑿\bm{X}bold_italic_X. This typically requires external supervision or knowledge of an appropriate similarity measure. Computing similarity between 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG and 𝑿\bm{X}bold_italic_X can be prohibitively expensive, or even impossible.545454For instance, minimizing a distributional distance between the two random variables is statistically and computationally difficult. In nature, animals learn autonomously without comparing their estimate 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG with the ground truth 𝑿\bm{X}bold_italic_X in the data space; they typically lack that option.555555In animals and humans, the eyes process visual data long before it arrives at the brain; the brain does not have a mechanism to process raw visual data and must rely on the eyes’ features to do any learning.

How, then, can a system learn without external supervision or comparison? How can it know that 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG is consistent with 𝑿\bm{X}bold_italic_X without direct comparison? This leads to the idea of “closing the loop.” Under mild conditions (made precise in Chapter 5), ensuring consistency between 𝑿\bm{X}bold_italic_X and 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG reduces to encoding 𝑿^\hat{\bm{X}}over^ start_ARG bold_italic_X end_ARG as 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG and checking consistency between 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG. We call this notion self-consistency, illustrated by:

𝑿𝒁𝒟𝑿^𝒁^,\bm{X}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\bm{Z}\xrightarrow{\hskip 5.69054pt\mathcal{D}\hskip 5.69054pt}\hat{\bm{X}}\xrightarrow{\hskip 5.69054pt\mathcal{E}\hskip 5.69054pt}\hat{\bm{Z}},bold_italic_X start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW bold_italic_Z start_ARROW start_OVERACCENT caligraphic_D end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_X end_ARG start_ARROW start_OVERACCENT caligraphic_E end_OVERACCENT → end_ARROW over^ start_ARG bold_italic_Z end_ARG , (1.4.17)

We call this process closed-loop transcription,565656Inspired by transcription between DNA and RNA or other proteins. depicted in Figure 1.24.

Figure 1.24 : Illustration of a closed-loop transcription. Here we use a mapping f f italic_f to represent the encoder ℰ \mathcal{E} caligraphic_E and g g italic_g to represent the decoder 𝒟 \mathcal{D} caligraphic_D .
Figure 1.24: Illustration of a closed-loop transcription. Here we use a mapping ffitalic_f to represent the encoder \mathcal{E}caligraphic_E and ggitalic_g to represent the decoder 𝒟\mathcal{D}caligraphic_D.

It is arguably true that any autonomous intelligent being only needs to learn a self-consistent representation 𝒁\bm{Z}bold_italic_Z of the observed data 𝑿\bm{X}bold_italic_X, because checking consistency in the original data space—often meaning in the external world—is either too expensive or even physically infeasible. The closed-loop formulation allows one to learn an optimal encoding f(,𝜽)f(\cdot,\bm{\theta})italic_f ( ⋅ , bold_italic_θ ) and decoding g(,𝜼)g(\cdot,\bm{\eta})italic_g ( ⋅ , bold_italic_η ) via a min-max game that depends only on the internal (learned) feature 𝒁\bm{Z}bold_italic_Z:

max𝜽min𝜼[(𝒁,𝒁^)+ρ(𝒁)],where𝒁=f(𝑿),𝑿^=g(𝒁),𝒁^=f(𝑿^)\max_{\bm{\theta}}\min_{\bm{\eta}}[\ell(\bm{Z},\hat{\bm{Z}})+\rho(\bm{Z})],\qquad\text{where}\quad\bm{Z}=f(\bm{X}),\quad\hat{\bm{X}}=g(\bm{Z}),\quad\hat{\bm{Z}}=f(\hat{\bm{X}})roman_max start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_η end_POSTSUBSCRIPT [ roman_ℓ ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) + italic_ρ ( bold_italic_Z ) ] , where bold_italic_Z = italic_f ( bold_italic_X ) , over^ start_ARG bold_italic_X end_ARG = italic_g ( bold_italic_Z ) , over^ start_ARG bold_italic_Z end_ARG = italic_f ( over^ start_ARG bold_italic_X end_ARG ) (1.4.18)

where (𝒁,𝒁^)\ell(\bm{Z},\hat{\bm{Z}})roman_ℓ ( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) is a loss function based on coding rates of the features 𝒁\bm{Z}bold_italic_Z and 𝒁^\hat{\bm{Z}}over^ start_ARG bold_italic_Z end_ARG, which, as we will see, can be much easier to compute. Here again, ρ(𝒁)\rho(\bm{Z})italic_ρ ( bold_italic_Z ) is some measure that promotes parsimony and information richness of 𝒁\bm{Z}bold_italic_Z. Somewhat surprisingly, as we will see in Chapter 5, under rather mild conditions such as 𝑿\bm{X}bold_italic_X being sufficiently low-dimensional, self-consistency between (𝒁,𝒁^)(\bm{Z},\hat{\bm{Z}})( bold_italic_Z , over^ start_ARG bold_italic_Z end_ARG ) implies consistency in (𝑿,𝑿^)(\bm{X},\hat{\bm{X}})( bold_italic_X , over^ start_ARG bold_italic_X end_ARG )! In addition, we will also see that a closed-loop system will allow us to learn the distribution in a continuous and incremental manner,575757That is, to learn with one class at a time or even one sample at a time. without suffering from problems such as catastrophic forgetting associated with open-loop models.

1.5 Bridging Theory and Practice for Machine Intelligence

So far, we have introduced three related frameworks for learning a compact and structured representation 𝒁\bm{Z}bold_italic_Z for a given data distribution 𝑿\bm{X}bold_italic_X:

  • the open-ended encoding (1.4.11);

  • the bi-directional auto-encoding (1.4.15);

  • the closed-loop transcription (1.4.17).

In this book, we will systematically study all three frameworks, one after another:

open-endedbi-directionalclosed-loop,\text{{open-ended}}\;\Longrightarrow\;\text{{bi-directional}}\;\Longrightarrow\;\text{{closed-loop}},open-ended ⟹ bi-directional ⟹ closed-loop , (1.5.1)

in Chapter 4, Section 5.1 and Section 5.2 of Chapter 5, respectively

In recent years, many theoretical frameworks have been proposed to help understand deep networks. However, most have failed to provide scalable solutions that match the performance of empirical methods on real-world data and tasks. Many theories offer little practical guidance for further empirical improvement. Chapter 6 and Chapter 7 will demonstrate how the framework presented in this book can bridge the gap between theory and practice. Chapter 6 will show how to use the learned distribution and its representation to conduct (Bayesian) inference for practical tasks involving (conditional) generation, estimation, and prediction. Chapter 7 will provide compelling experimental evidence that networks and systems designed from first principles can achieve competitive—and potentially superior—performance on tasks such as visual representation learning, image classification, image completion, segmentation, and text generation.

Back to intelligence.

As mentioned at the beginning, a fundamental task of any intelligent being is to learn predictable information from sensed data. We now understand the computational nature of this task and recognize that it is never-ending for two reasons:

  • Knowledge learned from data, say via encoding and decoding schemes, is unlikely to be correct or optimal. Intelligence must be able to improve when predictions of new observations contain errors.

  • Observed data do not yet cover all predictable information. Intelligence must recognize when current knowledge is inadequate and acquire new information whenever it becomes available.

Thus, intelligence is not about collecting all data in advance and training a model to memorize predictable information. Instead, it requires computational mechanisms that continuously improve current knowledge and acquire new information when needed. A fundamental characteristic of any intelligent being or system—whether an animal, human, robot, scientific community, or civilization—is the ability to continuously improve or gain information on its own. We can symbolically express the relationship between intelligence and information as:

Intelligence(t)=ddtInformation(t),Information(t)=0tIntelligence(s)ds.\operatorname{Intelligence}(t)=\frac{\mathrm{d}}{\mathrm{d}t}\operatorname{Information}(t),\qquad\operatorname{Information}(t)=\int_{0}^{t}\operatorname{Intelligence}(s)\mathrm{d}s.roman_Intelligence ( italic_t ) = divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG roman_Information ( italic_t ) , roman_Information ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Intelligence ( italic_s ) roman_d italic_s . (1.5.2)

The closed-loop framework provides a universal mechanism for self-improvement and self-learning through feedback—reinforcement being a primitive form, as in natural selection—or gaming, with scientific inquiry representing its most advanced form through hypothesis formulation and verification. All intelligent beings and systems in nature employ closed-loop mechanisms for learning at every level and scale. This ubiquity inspired early attempts to model intelligence with machines and computers, particularly the Cybernetics movement initiated by Norbert Wiener in the 1940s.

We hope this book helps readers better understand the objectives, principles, and computational mechanisms underlying intelligence. It lays the groundwork for future study of higher-level human intelligence—true artificial intelligence. We will outline several significant open problems in these directions in Chapter 8.