Chapter 8 Future Study of Intelligence

The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problem now reserved for humans, and improve themselves.

  – Proposal for the Dartmouth AI program, 1956

Generally speaking, this manuscript is meant to systematically introduce mathematical principles and computational mechanisms for how memory or knowledge can be developed from empirical observations. The capability to seek parsimony in a seemingly random world is a fundamental characteristic of any intelligence, natural or man-made. We believe that the principles and mechanisms presented in this book are rather unifying and universal and are applicable to both animals and machines.

We hope that this book can help the readers fully clarify the mystery around modern practices of artificial deep neural networks by developing a rigorous understanding of their functions and roles in achieving the objective of learning low-dimensional distributions from high-dimensional data. With such an understanding, we should have become rather clear about both the capabilities and limitations of existing AI models and systems:

  1. 1.

    Existing models and systems fall short of being complete in terms of a memory system that is capable of self-learning and self-improving.

  2. 2.

    Existing realizations of these functions are still rather primitive and brute force and certainly far from optimal in terms of optimization strategies and hence network architectures.

  3. 3.

    Existing AI models only learn the data distribution and conduct inductive (Bayesian) inference, which is different from high-level human intelligence.

One of the goals of this book is for people to establish an objective and systematic understanding of current machine intelligence technologies and to realize what open problems and challenges remain ahead for further advancement of machine intelligence. In the last chapter of the book, we provide some of our views and projections for the future.

8.1 Towards Autonomous Intelligence: Close the Loop?

From the practice of machine intelligence in the past decade, it has become clear that, if there were sufficient data and computational resources, one could build a large enough model and pre-train it to learn the a priori distribution of all the data, say p(𝒙)p(\bm{x})italic_p ( bold_italic_x ). Theoretically, such a large model can memorize almost all existing knowledge about the world that has been encoded in massive languages and texts. As we have discussed at the beginning of the book, in a way, such a large model plays a similar role as DNAs with which life uses to record and pass on knowledge about the world.

The model and distribution learned in this way can then be used to regenerate new data samples based on the same distribution. One can also use the model to conduct inference (e.g., estimation, prediction) with the memorized knowledge under various conditions, say by sampling the a posteriori distribution p(𝒙𝒚)p(\bm{x}\mid\bm{y})italic_p ( bold_italic_x ∣ bold_italic_y ) under a new observation 𝒚\bm{y}bold_italic_y. Strictly speaking, such inference is statistical.

Any pre-trained model, however large, cannot guarantee that the distribution that it has learned so far is entirely correct or complete. In case our samples 𝒙^t\hat{\bm{x}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the current a priori pt(𝒙)p_{t}(\bm{x})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) or estimates 𝒙^t(𝒚)\hat{\bm{x}}_{t}(\bm{y})over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_y ) based on the a posteriori pt(𝒙𝒚)p_{t}(\bm{x}\mid\bm{y})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ∣ bold_italic_y ) are inconsistent with the truth 𝒙\bm{x}bold_italic_x, we would very much like to correct the learned distributions:

pt(𝒙)pt+1(𝒙)orpt(𝒙𝒚)pt+1(𝒙𝒚),p_{t}(\bm{x})\rightarrow p_{t+1}(\bm{x})\quad\mbox{or}\quad p_{t}(\bm{x}\mid\bm{y})\rightarrow p_{t+1}(\bm{x}\mid\bm{y}),italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) → italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_italic_x ) or italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ∣ bold_italic_y ) → italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_italic_x ∣ bold_italic_y ) , (8.1.1)

based on the error 𝒆t=𝒙t𝒙^t\bm{e}_{t}=\bm{x}_{t}-\hat{\bm{x}}_{t}bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This is known as error correction based on error feedback, a ubiquitous mechanism in nature for continuous learning. However, as we know, any open-ended model itself does not have the mechanism to revise or improve the learned distribution when it is incorrect or incomplete. Improving current AI models still largely depends on human involvement: experimentation, evaluation, and selection. We may call this process “artificial selection” of large models, as opposed to the natural selection for the evolution of life.

As we have studied earlier in this book (Chapter 5 in particular), closed-loop systems allow us to align an internal representation with the (sensed) observations of the external world. It can continue to improve the internally learned distribution and its representation to achieve consistency or self-consistency. An immediate step forward for the future is to develop and build truly closed-loop memory systems, as shown in Figure 8.1, that are capable of learning and improving more general data distributions autonomously and continuously based on error feedback.

Therefore, the transition from the current popular end-to-end trained open-loop models to continuously-learning closed-loop systems:

open-ended modelsclosed-loop systems\mbox{{open-ended} models}\;\Longrightarrow\;\mbox{{closed-loop} systems}bold_open-ended models ⟹ bold_closed-loop systems (8.1.2)

is the key for machines to truly emulate how the (animal) brain learns and applies knowledge in an open world. We believe that

open-ended models are for a closed world, however large;
closed-loop systems are for an open world, however small.

In fact, “general intelligence” could never be achieved by simply having memorized all existing knowledge of the world. Instead, general intelligence can only be achieved by having the mechanisms to improve its existing memory so as to be able to adapt to any new environments and tasks.

Figure 8.1 : From an open-ended deep network to a closed-loop system.
Figure 8.1: From an open-ended deep network to a closed-loop system.

8.2 Towards Intelligence of Nature: Beyond Back Propagation?

The practice of machine intelligence in the past few years has led many to believe that one needs to build a single large model to learn the distribution of all data and memorize all knowledge. Even if this might be technologically possible, it is likely that such a solution is far from necessary and efficient. As we have known from the practice of training deep networks, the only known scalable method to train such networks at scale is through back propagation (BP) [RHW86a]. Although BP has offered a way to correct errors via gradient signals propagated back through the whole model, it is nevertheless rather brute force and differs significantly from how nature learns: BP is an option that nature cannot afford in terms of its high cost or simply cannot implement due to physical limitations.

More generally, we cannot truly understand intelligence unless we also understand how it can be efficiently implemented. That is, one needs to address the computational complexity of realizing mechanisms associated with achieving the objectives of intelligence. Note that, historically, our understanding of (machine) intelligence has precisely evolved through several phases, from the incomputable Kolmogorov complexity to Shannon’s entropy, from Turing’s computability to later understanding of tractability,111We say a problem is tractable if it allows an algorithm whose complexity is polynomial in the size of the problem. and to the strong emphasis on algorithm scalability in modern practice of artificial intelligence. This evolution can be summarized as the following diagram:

incomputablecomputabletractablescalable.\mbox{{incomputable}}\;\Longrightarrow\;\mbox{{computable}}\;\Longrightarrow\;\mbox{{tractable}}\;\Longrightarrow\;\mbox{{scalable}}.incomputable ⟹ computable ⟹ tractable ⟹ scalable . (8.2.1)

To a large extent, the success and popularity of deep learning and back propagation is precisely because they have offered a scalable implementation with modern computing platforms (such as GPUs) for processing and compressing massive data. Nevertheless, such an implementation is still way too more expensive compared to how nature realizes intelligence.

There remains huge room for improvement of the efficiency of machine intelligence so that it can emulate the level of efficiency of natural intelligence, which should be orders of magnitude more efficient than the current brute-force implementations. To this end, we need to discover new learning architectures and optimization mechanisms that enable learning data distributions under natural physical conditions and resource constraints, similar to those for intelligent beings in nature, say, without accessing all data at once or updating all model parameters at once (by BP).

The principled framework and approach laid out in this book can guide us to discover such new architectures and mechanisms. These new architectures and mechanisms should enable online continuous learning and can be updated through highly localized and sparse forward or backward optimization. So far, for learning a distribution, the only case for which we know such a solution exists is the simplest case of PCA, with the online PCA method introduced in Chapter 5.

Figure 8.2 : Conjectured architecture of the brain cortex: The cortex is a massively parallel and distributed auto-encoding system that consists of a hierarchy of closed-loop auto-encoders that extract information from multiple senses and maximize the information gain of the resulting representations at multiple levels of hierarchy and granularity.
Figure 8.2: Conjectured architecture of the brain cortex: The cortex is a massively parallel and distributed auto-encoding system that consists of a hierarchy of closed-loop auto-encoders that extract information from multiple senses and maximize the information gain of the resulting representations at multiple levels of hierarchy and granularity.

As we have learned from neuroscience, the cortex of our brain consists of tens of thousands of cortical columns. All cortical columns have similar physical structures and functions. They are highly parallel and distributed, though sparsely interconnected. Hence, we believe that in order to develop a more scalable and structured memory system, we need to consider architectures that emulate those of the cortex. Figure 8.2 shows such a hypothesized architecture, a massively distributed and hierarchical system that consists of many largely parallel closed-loop auto-encoding modules. These modules learn to encode different sensory modalities or many projections of data from each sensory modality. Our discussion in Section 6.5 of Chapter 6 suggests that such a parallel sensing and learning of a low-dimensional distribution is theoretically possible. Higher-level (lossy) autoencoders can then be learned based on outputs of lower-level ones to develop more sparse and higher-level “abstractions” of the representations learned by the lower levels.

The distributed, hierarchical, and closed-loop system architecture illustrated in Figure 8.2 shares many characteristics of cortex of the brain. Such a system architecture may open up many more possibilities than the current single large-model architecture. It makes exploring much more efficient learning and optimization mechanisms possible, and results in more structured modular organization of the learned data distribution and knowledge. This would allow us to bring the implementation of machine intelligence to the next level of evolution:

incomputablecomputabletractablescalablenatural.\mbox{{incomputable}}\;\Longrightarrow\;\mbox{{computable}}\;\Longrightarrow\;\mbox{{tractable}}\;\Longrightarrow\;\mbox{{scalable}}\;\Longrightarrow\;\mbox{{natural}}.incomputable ⟹ computable ⟹ tractable ⟹ scalable ⟹ natural . (8.2.2)

8.3 Towards Artificial Intelligence of Human: Beyond the Turing Test?

As we have discussed at the beginning of this book, Chapter 1, intelligence in nature has evolved through multiple phases and manifested in four different forms:

phylogenticontogeneticsocietalartificial intelligence.\mbox{{phylogentic}}\;\Longrightarrow\;\mbox{{ontogenetic}}\;\Longrightarrow\;\mbox{{societal}}\;\Longrightarrow\;\mbox{{artificial intelligence}}.phylogentic ⟹ ontogenetic ⟹ societal ⟹ artificial intelligence . (8.3.1)

All forms of intelligence share the common objective of learning useful knowledge as certain low-dimensional distributions of sensed high-dimensional data about the world. However, they may differ significantly in the specific coding schemes adopted, the information encoded, the computational mechanisms for learning and improving, and the physical implementations of such mechanisms. Using the concepts and terminologies developed in this book, from the perspective of learning and representing information or knowledge from the distribution of the sensed data, the above four stages of intelligence developed in nature differ in the following three aspects:

  1. 1.

    The codebook that one uses to learn and encode the intended information or knowledge.

  2. 2.

    The information or knowledge that is encoded and represented using the codebook.

  3. 3.

    The optimization mechanisms used to improve the information or knowledge encoded.

More specifically, the following table summarizes their main characteristics in the above three aspects:

Phylogentic Ontogenetic Societal Artificial
Codebook Amino Acids Neurons Alphabet & Words Mathematics/Logic
Information Genes/DNAs Memory Languages/Texts Scientific Facts
Optimization Reinforce Learning Error Feedback Trial & Error Hypothesis Testing

As we now know, humans have achieved two quantum leaps in intelligence in history. The first is the development of spoken and written languages about five to ten thousand years ago. That has enabled humans to share and pass on learned knowledge for generations, similar to the role of DNAs in nature. The second is the development of mathematics and logic about three thousand years ago, which have become a precise language for modern science. This new language has freed us from summarizing knowledge from observations in empirical forms and allowed us to formalize knowledge as verifiable or falsifiable theories through either mathematical deduction or experimental verification. Through hypothesis formulation, logical deduction, and experimental testing, we are able to proactively discover new knowledge that was previously impossible by passively learning from data distribution.222such as causal relationships

As we have discussed in the Introduction (Chapter 1), the 1956 “artificial intelligence” (AI) program precisely aimed to study high-level functions such as mathematical abstractions, logical inference, and problem solving that are believed to differentiate humans from animals:

low-level (animal) intelligencehigh-level (human) intelligence.\mbox{{low-level} (animal) intelligence}\;\Longrightarrow\;\mbox{{high-level} (human) intelligence.}bold_low-level (animal) intelligence ⟹ bold_high-level (human) intelligence. (8.3.2)

As we have clarified repeatedly in this book, much of the technological advances in machine intelligence in the past decades, although carried out under the name “AI”, are actually more closely related to the low-level intelligence shared by both animals and humans, which is mainly inductive. So far, there has been no evidence suggesting that these mechanisms alone would suffice to achieve high-level human intelligence that the original AI program truly aims to understand and emulate.

In fact, we know little about how to rigorously verify whether a system is truly capable of certain high-level intelligence, despite the fact that the Turing test has been proposed since 1950 [Tur50].333In Turing’s proposal, the evaluator is a human. However, most human evaluators’ scientific training and knowledge can be limited and their conclusions can be subjective. For long such a test was not deemed necessary since the capabilities of machines were far below those of a human (or even animal). However, given recent technological advances, many models and systems have claimed to reach and even surpass human intelligence. Therefore, it is high time to give a scientific and executable definition of the Turing test. That is, how do we systematically and objectively evaluate the level of intelligence for a given model or system? For example, how can we rigorously verify whether an intelligent system has truly grasped a certain abstract concept, such as the notion of natural or real numbers, or whether it has merely memorized a large number of instances? Note that the state-of-the-art large language models still struggle with simple mathematical questions like: “Whether 3.11 is larger or smaller than 3.9?”444Note that some models have corrected their answers to questions of this kind via post engineering. Or some models have incorporated additional reasoning mechanisms based on verifying the immediate answers produced and correcting them during reasoning. However, we leave it to the reader as an exercise to rigorously test whether any of the state-of-the-art language models truly understand the notion of numbers (natural, rational, real, and complex) and the associated arithmetic. How do we verify whether a system truly understands the rules of logic and knows how to apply them rigorously or has simply memorized a large number of instances of practicing logic? Furthermore, is such a system even capable of correcting its own knowledge or developing new knowledge such as physical laws, mathematical concepts, or causal relationships? In summary, it is high time that we develop rigorous evaluation methods that can tell to which of the following a system/model’s seemingly intelligent capability belongs:

  1. 1.

    simply having memorized the distribution of some knowledge-carrying data and regenerating it;

  2. 2.

    being able to autonomously and continuously develop new knowledge from new observations;

  3. 3.

    truly having understood certain abstract knowledge and knowing how to apply it correctly;

  4. 4.

    being able to generate new scientific hypotheses or mathematical conjectures and verify them.

Figure 8.3 : Three tests for different levels or types of intelligence capabilities: the Wiener test for basic intelligence, the Turing test for human-level artificial intelligence, and the Popper test for scientist-level intelligence.
Figure 8.3: Three tests for different levels or types of intelligence capabilities: the Wiener test for basic intelligence, the Turing test for human-level artificial intelligence, and the Popper test for scientist-level intelligence.

Figure 8.3 illustrates that probably there should be at least three different types of tests to evaluate and differentiate different types of intelligence capabilities:

  1. 1.

    The Norbert Wiener Test: To evaluate whether a system is capable of improving and developing new knowledge of its own or simply receives information through reinforced or supervised learning;

  2. 2.

    The Alan Turing Test: To evaluate whether a system can understand abstract knowledge or simply learns its statistics and uses it for Bayesian inference.

  3. 3.

    The Karl Popper Test: To evaluate whether a system is capable of exploring new knowledge through forming and verifying new theories based on self-consistency.

We believe that, for such evaluation methods, the evaluator should not be a human but rather a scientifically sound protocol and process.

As we have seen throughout this entire book, compression has played a most fundamental role in learning. It is the governing principle and a unified mechanism for identifying an (empirical) data distribution and organizing information encoded therein. To a large extent, it explains most of the practice of “artificial intelligence” in the past decade or so. Here the word “artificial” largely means “man-made.” An outstanding question for future study is whether compression alone is sufficient to achieve all the higher-level capabilities of intelligence listed above.

Is compression all there is?

Are abstraction, causal inference, logical reasoning, and hypothesis generation and the subsequent deduction certain extended or extreme forms of compression? Is there some fundamental difference between identifying empirical data distributions through compression and forming high-level abstract concepts and theories? Philosopher Sir Karl Popper once suggested:

“Science may be described as the art of systematic oversimplification.”

To a large extent, science, and its associated codebook mathematics, can be viewed as the most advanced form of intelligence, hence the true “artificial” part of our intelligence. Here the word “artificial” means what is unique to educated and enlightened humans, almost like a form of high art. We believe that uncovering and understanding the underlying mathematical principles and computational mechanisms of such higher-level intelligence will be the final frontier for Science, Mathematics, and Computation altogether!