“All roads lead to Rome.”
Objective. This book reveals and studies a common and fundamental problem behind almost all modern practices of machine (artificial) intelligence. That is, how to effectively and efficiently learn a low-dimensional distribution of data in a high-dimensional space and then transform the distribution to a compact and structured representation? For any intelligent system, natural or man-made, such a representation can be generally regarded as a memory or (empirical) knowledge learned from data sensed from the external world. In recent years, people often informally refer to it as a “world model.”
Intended Audience. This textbook aims to provide a systematic introduction to the mathematical and computational principles for learning (deep) representations of such data distributions, as a computable form of memory, for senior undergraduate students and beginning graduate students. The main prerequisites for this book are undergraduate linear algebra, probability/statistics, and optimization. Some familiarity with basic concepts from signal processing (sparse representation and compressed sensing in particular), information theory, and feedback control would enhance your appreciation.
Motivation. The main motivation for writing this book is that there have been tremendous developments in the past several years, by the authors and many colleagues, that aim to establish a principled and rigorous approach to understand deep neural networks and, more generally, intelligence itself. The deductive methodology advocated by this new approach is in direct contrast, and highly complementary, to the dominant methodology behind current practices of artificial intelligence, which is largely inductive and trial-and-error. The lack of understanding about such powerful AI models and systems has led to increasing hype and fears in society. We believe that a serious attempt to establish a principled approach to understand intelligence is more needed than ever. An overarching goal of this book is to provide solid theoretical and experimental evidence showing that it is now possible to study intelligence as a scientific and mathematical subject. As we will argue that intelligence is the fundamental capability to develop new memory (or knowledge) or correct existing one. Hence, one may view this book as a first attempt to develop a Mathematical Theory of Intelligence, at the level of learning empirical knowledge as memory, as the subtitle of the book suggests.
At the technical level, the theoretical framework presented in this book helps reconcile a long-standing gap between the classical approach to modeling data structures that are mainly based on analytical geometric, algebraic, and probabilistic models (e.g., subspaces, Gaussians, and equations) and the “modern” approach that is based on empirically designed non-parametric data-driven models (e.g., deep networks). As it turns out, a unification of the two seemingly separate methodologies becomes possible and even natural if one realizes that they all try to learn and represent low-dimensional structures in the data distribution of interest. They are merely different ways to pursue, represent, and exploit the low-dimensional structures. From this perspective, even many seemingly unrelated computational techniques, developed independently in separate fields at different times, can now be better understood under a common theoretical and computational framework and probably can be studied together from now on. As we will see in this book, these techniques include but are not limited to:
Main Content. We believe that the unified conceptual and computational framework presented in this book will be of great value to readers who truly want to clarify mysteries and misunderstandings about deep neural networks and (artificial) intelligence. Furthermore, the framework is meant to provide readers with guiding principles for developing significantly better and truly intelligent systems in the future. More specifically, besides an informal introduction (chapter), the main technical content of the book will be organized as six closely related topics (chapters):
Summary. To summarize, the technical content presented in this book establishes strong conceptual and technical connections between the classical analytical approach and the modern data-driven approach, between simple parametric models and deep non-parametric models, between diverse inductive practices and a unified deductive framework from first principles. We will reveal that many seemingly unrelated or even competing approaches, though developed in separate fields with different terminologies and at different times in history, yet all strive to achieve a common objective:
pursuing and exploiting intrinsic low-dimensional structures of data distributions embedded in high-dimensional spaces.
To this end, the book will take us through a complete journey from theoretical formulation to mathematical deduction, then to computational implementation and practical applications.