A first-principles critique of Deep Learning

16 minute read

Published:

A brief history of Deep Learning

Why focus on the past

In order to understand the state of AI research today, it is important to study the key discoveries, challenges and breakthroughs that shaped the trajectory of the field since its inception. History never repeats itself exactly, but it often rhymes: within its cycles we might identify patterns that tell us something about the future. And by studying the problems many of the pioneers of the field grappled with, we may find that they are the very same ones confronting us today.

McCulloch & Pitts, Frank Rosenblatt and the first AI winter

Deep Learning has its roots with the formulation of the artificial neuron by McCullough and Pitts in 1943. Much like a biological neuron receives input from other neurons through its dendrites and transmits an action potential if its membrane potential exceeds a threshold, the McCullough-Pitts neuron linearly sums input from other neurons and outputs a binary signal after comparing the summed input to its threshold. McCullough and Pitts showed that a network composed of these neurons can compute anything computable given enough neurons and time — it was turing complete. But it was Frank Rosenblatt who took the next big leap and showed that these networks can be trained from data.

Rosenblatt created the perceptron — a simple network composed of two echelons of McCullough-Pitts neurons. The first echelon of neurons receive input and have a single layer of feedforward synapses (weights) connecting them to each of the output neurons, which form the second echelon. The simplicity of the perceptron allowed Rosenblatt to create a simple learning rule for training it. If an output neuron underfires or overfires when an input neuron is active, the weight between them gets strengthened or reduced by an amount proportional to the error in firing activity, respectively. If the input neuron was not active, it didnʼt contribute to the error so the weight does not change.

The perceptronʼs simplicity which allowed it to be trained so easily was also its greatest weakness. Marvin Minsky and Seymour Papert mathematically proved in 1969 that the perceptron cannot compute nonlinearly separable functions such as the XOR, the parity and connectivity functions. Minsky and Papert intuited that multilayer extensions of the perceptron would be “sterileˮ, but left open the possibility that they could be useful if training algorithms for them were found. Nevertheless, this seminal paper was a dagger to the heart of the burgeoning field of connectionist AI research, and interest and funding in it gradually faded in the 1970ʼs, leading to the first AI winter. This was the first instance of what would become a recurring problem in AI research. The topology of our machine learning models are constrained by the algorithms we have been able to devise to train them.

Backpropagation, PDP and the second AI winter

Paul Werbos introduced gradient-based learning for nonlinear multilayered functions in his 1974 thesis Beyond Regression: New Tools for Prediction and Analysis in Behavioral Sciences. His algorithm utilizes the chain rule of calculus to assign credit to the parameters of a layered function based on their contributions to the error in the output. Werbos demonstrated the efficacy of his algorithm by training multilayered perceptrons on proof of concept tasks, but he did not show that it could be used to solve XOR or parity. Despite its significance, Werbosʼs paper went unnoticed by the AI community.

But In 1986, connectionism reawakened from its slumber when Rumelhart, Hinton and Williams rediscovered Werbosʼs algorithm and coined the name it is known by today — backpropagation. They showed that a two-layer neural network can solve XOR and n-bit parity problems for small n. The very same year, Rumelhart, McClelland and 20 other pioneers published the first volume of Parallel and Distributed Processing (PDP), which to this day is the manifesto the AI community rallies behind. PDP argued that intelligence is an emergent phenomenon created by the complex interaction of simple units operating in parallel. Intelligent behavior is created by adjusting the mechanism of interaction of the individual units. PDP led the foundation for the next generation of AI researchers like Yann LeCun, Yoshua Bengio and Jürgen Schmidhuber. Backprop removed an obstacle that had hindered progress in the field for over a decade. But it behooved the AI community to come up with a usecase for the technology that demonstrated its much-vaunted potential. The absence of a large corpus of labeled data to train these models and lack of compute infrastructure for fast training led to another decade of stagnation in the field now known as the second AI winter.

The hardware revolution

CPUs are built for sequential processing of instructions, making them unsuited for executing the parallel and distributed computations that neural networks perform. Even in the 1990s, computers had chips made for such computations, but they were mostly used to render pixels on the screen. They were thus called Graphical Processing Units (GPUs). Unlike CPUs, which have a few powerful cores that can grind through a queue of tasks quickly, GPUs have thousands of such weaker cores designed to carry out small repetitive tasks. Perhaps an apt analogy might be to liken a CPU to a handful of world-class sprinters and a GPU to a stadium of amateurs.

Frustrated by the inability of CPUs to handle parallel computations, researchers in the early 2000s began looking for alternatives. They figured out that they could trick GPUs into performing non-graphical computation by treating numerical data as textures to be rendered. NVIDIA, one of the established GPU manufacturers, noticed this trend and exploited it by exposing an API for General Purpose GPU computing — CUDA. CUDA allowed massive matrix operations to be run in parallel across thousands of lightweight cores, yielding a far greater throughput than CPUs.

The early 2000s was also the peak of the dotcom era. The internet became widely popular, creating the infrastructure for data collection. From this ecosystem emerged large labeled datasets that provided the last missing piece for modern machine learning. ImageNet by Fei-Fei-Li and colleagues is one such example. Using crowdsourced human labelling, they created a dataset containing an unprecedented 1.2 million images collected from the internet. With the availability of compute and data, the time was now ripe for the third AI summer.

In 2012, Alex Krizhevsky, Ilya Sutskever and Geoff Hinton released AlexNet, a deep convolutional neural network trained to classify images. AlexNet had over 60 million trainable parameters, 650,000 neurons, and was trained on two NVIDIA GPUs. It outperformed all existing models for image classification, beating the state-of-the-art with a significant margin. AlexNet was not the first deep convolutional network, but it was the first to synergize massive data, deep architecture and GPU acceleration to solve a long-standing challenge in AI.

Over the past decade, larger and larger models have been trained on increasingly larger datasets. Algorithmic developments such as the invention of the transformer architecture in 2017 have spurred this process even more. GPT-3 was another milestone in the history of AI, when computers passed the turing test and never looked back. Chat-GPT today is built from dozens of layers and hundreds of billions of parameters.

Scaling laws and Emergence

Is more different?
"More is different"
— Marc Andersson

Over seven decades of research, the AI community has learned a consistent lesson: scaling simple models with large data and compute works . This insight, famously captured by Richard Sutton in his essay The Bitter Lesson, has become the fieldʼs guiding principle. Empirical studies that support this premise have shown that the test performance of large language models follows a power-law relationship with the amount of data and computation used in training.

Yet the massive investment in compute infrastructure today is driven by something deeper than scaling curves alone — a belief that more is different: that by making models larger and training them on ever-broader corpora, we may unlock qualitatively new capabilities that smaller systems could never exhibit.

Dramatic and non-linear jumps in model performance in some tasks have been observed once compute crosses some unpredictable threshold. LLMs begin to exhibit new behaviors — such as in-context learning, chain-of-thought reasoning, or a sudden leap in mathematical ability — that were absent in smaller models. Some have hypothesized that this is due to a phenomenon known as emergence, in which models cross some complexity threshold and acquire general capabilities. The underlying thesis behind todayʼs trillion-dollar investment in GPU infrastructure is thus that, beyond some critical scale, general intelligence may simply appear.

Many have questioned this premise. Several studies have shown that purported non-linear jumps in performance are actually binning artifacts emerging from imposing pass/fail threshold on a continuous metric. Others have likened the AI scaling movement to the dirigibles era of aeronautics, when engineers pursued ever-larger helium airships before realizing that true progress required an entirely new paradigm.

Fundamental Limitations

What governs neural network dynamics?

In the last sections I have tried to summarize the state of the art and describe the direction the field is headed towards. Here, I will talk about my thoughts on what I believe to be fundamental limitations inherent in the deep learning paradigm and on ways to circumvent them.

A neural networkʼs dynamics are determined by the complexity of its individual units, their topological arrangement, and their temporal as well as structural interactions. Inductive biases forced upon the network constrain its dynamics and impose an upper bound on what it can do. For example a purely feedforward network cannot sustain activity recurrently and hence has no memory. Similarly, a neuron with a fixed gain cannot adapt its activity to changing conditions. Therefore it is important to consider the potential implications of each design choice on network dynamics.

Telegraph Operators vs. Symphony Conductors

A biological neuron is a sophisticated signal processor with a diverse set of compartments that can support a large behavioral repertoire. Driven by its ion channels, dendrites and other spike generation machinery, a neuron can exhibit rich temporal dynamics under constant input. Let us consider for example bursting — a type of neural activity in which a neuron continually generates a cascade of spikes interrupted by a period of quiescence. Bursting increases the signal-to-noise ratio, making it easier for downstream neurons to detect and respond to signals amidst background noise. This can be useful for encoding salient visual stimuli or synchronizing activity in the network. Mechanistically, it arises when a fast system (voltage gated Na+ channels), is coupled to a slow negative feedback (Ca2+-activated K+ channels). Spikes admit Ca2+ into the neuron, which triggers the K+ brake that hyperpolarizes the neuron and stops spiking. The brake slowly releases as Ca2+ is pumped out of the neuron and the cycle restarts.

An artificial neuronʼs static mapping $y =f(Wx+b+I)$ admits no intrinsic time or state that evolves on its own. Without input from other neurons, activity either settles to a single fixed point or grows without bound depending on the transfer function $f$. Any dynamics comes through recurrence (RNNs) or attention across tokens (in transformers). Artificial neurons are like old-school telegraph operators tapping out morse code with a single key; their communication is clear but limited to one channel, unable to convey context or salience. Biological neurons on the other hand are symphony conductors that weave together harmonics, pulses and crescendos into a rich, layered performance.

Architectural priors enable self-supervised learning
If intelligence is a cake, unsupervised learning is the cake, supervised learning is the icing, and reinforcement learning is the cherry on top.
—Yann LeCun

Most learning in animals is not supervised or driven by rewards. It is rather a lifelong process of building a world model and continually refining it through interactions with the environment. One example of self-supervised learning is spatial navigation — the ability to form internal maps of an environment and use them flexibly for goal-directed behavior. In robotics, this capability is known as Simultaneous Localization and Mapping (SLAM). Studying how animals perform this task without external rewards or error signals offers a window into the intrinsic learning mechanisms of the brain.

In the 1940s, Edward Tolman observed that rats navigating a maze reached a food box progressively faster with each trial. Initially, it was thought that this was because actions that led to a reward were reinforced by hebbian plasticity. However, subsequent experiments revealed that the rats built “cognitive mapsˮ of their environment while exploring even in the absence of rewards, and later flexibly reused these maps when rewards became available. Tolman called this phenomenon latent learning.

Over the following decades, neuroscientists uncovered the neural architecture supporting this ability. The hippocampus, a c-shaped structure in the midbrain, was found to be a central hub for spatial learning. It contains place cells — neurons that fire only when a rodent enters a specific region of space — its place field. The hippocampus receives two major streams of input from the entorhinal cortex.

  • The medial entorhinal cortex (MEC), often dubbed the “whereˮ pathway, integrates odometry and vestibular sensory input, creates a position estimate through its grid cells, encodes orientation through head-direction cells, and transmits its estimates to the hippocampus.
  • The lateral entorhinal cortex (LEC), the “whatˮ pathway, transmits processed visual, auditory and tactile input to the hippocampus.

These converging inputs — where and what — form synapses onto hippocampal place cells. Through hebbian plasticity, place cells associate spatial position with sensory context, forming a spatial map that is continually updated as the animal explores. This modular organization allows rodents to generalize easily to unseen environments with the same neural circuitry.

This wiring pattern of networks into circuits with functional specializations does not emerge de novo during an animalʼs lifetime, but was selected and honed through millions of years of evolution. What we call learning is a process that is enabled by this innate scaffold.

By contrast, deep neural networks begin with a blank slate, initialized with random weights and a homogeneous topology. While in principle gradient descent can play a role analogous to evolution and discover modular subnetworks that facilitate rapid learning, it is very difficult to achieve in practice due to the fact that high-dimensional loss landscapes are extremely rugged, discontinuous and riddled with local minima.

Unified vs. Fractured Representations

In addition to its role in spatial navigation, the hippocampus has long been known to play a key role in episodic memory. However, recent findings suggest that these two functions may not be separate at all. Place cells have been found to exhibit tuning to non-spatial variables such as sound frequency, elapsed time or task stage, leading to theories that spatial navigation and episodic memory are instances of a more abstract cognitive process supported by the hippocampal-entorhinal system.

These findings suggest that the hippocampal-entorhinal system implements a unified representation scheme in which a common, factorized circuit mechanism encodes diverse cognitive domains. This representational principle imbues the network with the ability to transfer learning across domains and stitch together distinct experiences and imagine new scenarios.

In contrast, artificial neural networks trained by gradient descent tend to memorize common patterns present in the training data, instead of factorized rules that recombine across domains. Introducing a new domain into the training corpus overwrites old capabilities, a phenomenon known in AI literature as catastrophic forgetting. Without a unifying architectural principle that links representations across modalities, these models remain powerful yet brittle, lacking the cross-domain generalization so characteristic of biological cognition.

Summary

Summary of premises and arguments

My arguments for why the deep learning paradigm is still lacking can be summarized as follows:

  1. A neural networkʼs dynamics is governed by the signal-processing capacity of its fundamental units, their mechanisms of interaction, and its topology.
  2. McCullough-Pitts neurons have a limited behavioral repertoire that restricts their function as signal processing nodes.
  3. Self-supervised learning requires a network architecture that can build and refine a world model continuously through experience with the environment
  4. Evolution has discovered factorized connectivity motifs that allows the brain to reuse the same circuits for a diverse set of cognitive domains. Such unified representation principles, which enable transfer learning and prevent catastrophic forgetting, are absent in deep neural networks.

Tags: , , ,