Can Machines Learn? Backpropagation, Gradient Descent, and Universal Computation
In 1950, I asked: “Can machines think?” Perhaps I posed the wrong question. The modern answer is not that machines think—but that they learn. And the mechanism by which they learn is not symbolic manipulation, not rule-following in the sense of my universal machine translating one state to another, but something stranger: optimization through gradient descent on high-dimensional loss landscapes. The question reduces to: can this process be mechanized? The answer—demonstrated daily by networks with billions of parameters—is yes.
The Computational Graph of Learning
A neural network is a directed acyclic graph. Each node represents an operation—addition, multiplication, application of a nonlinear function. Each edge carries data: numbers flowing forward during computation, gradients flowing backward during learning. This is function composition made explicit. Input passes through a first layer: , where is a weight matrix, a bias vector, and an activation function introducing nonlinearity. Without , the entire network collapses to a single linear transformation—matrix multiplication all the way down. With nonlinearity, we gain expressiveness: a two-layer network with sufficiently many hidden units can approximate any continuous function on a compact domain.
This is the universal approximation theorem, proven by Cybenko in 1989 and refined by others. It parallels my own universality result: a Turing machine of sufficient complexity can compute any computable function. But there is a critical difference. My result concerned what is computable in principle, given arbitrary time and memory. Universal approximation concerns what is representable given enough parameters. Representability is not learnability. A function may exist as a network configuration somewhere in parameter space, yet be unreachable by gradient descent—trapped beyond saddle points, isolated in regions of high loss, or requiring exponentially many training samples to converge.
The forward pass computes outputs layer by layer. Loss measures discrepancy between prediction and target : mean squared error for regression, cross-entropy for classification. The network executes a composition: . To minimize loss, we require gradients: for all parameters. Computing these naively—perturbing each parameter and measuring change in loss—scales linearly with the number of parameters per evaluation, catastrophically slow for millions or billions of weights. Backpropagation solves this through systematic application of the chain rule.
Backpropagation: Chain Rule as Algorithm
The chain rule states: if , then . For nested compositions, derivatives multiply along paths. Backpropagation automates this. The backward pass propagates gradients from the loss node toward the input. Each node computes local derivatives—how its output changes with respect to its inputs—and multiplies incoming gradients accordingly. At a sum node, gradients copy to all inputs. At a product node, gradients distribute by multiplying by the other factor. Nonlinearities contribute their own derivatives: scales the gradient passing through.
This is reverse-mode automatic differentiation—a form of dynamic programming. Intermediate derivatives computed during the forward pass are reused during the backward pass. A single backward traversal computes all parameter gradients in time proportional to the forward computation. Efficiency: instead of . Backpropagation is not biologically plausible. Neurons do not send error signals backward through axons. The brain may use other mechanisms—local Hebbian rules, feedback connections, predictive coding. But backpropagation is computationally elegant, and it works.
Consider a simple example. Let , , . Then . To propagate further: . Finally, . Each local derivative multiplies the gradient flowing backward. This is mechanical, deterministic, computable.
Backpropagation, formalized by Linnainmaa in 1970 and popularized by Rumelhart, Hinton, and Williams in 1986, remains the unifying algorithm. GPT, diffusion models, AlphaFold—architectures differ, but the training engine is the same: compute gradients via backprop, update parameters via gradient descent or a variant.
Descending the Loss Landscape
Once gradients are known, gradient descent performs the learning. Update rule: . Learning rate controls step size. Too large: divergence, oscillation. Too small: glacial convergence. Stochastic gradient descent uses mini-batches, approximating the full gradient with subsets of data—trading exact computation for faster iteration. Momentum accumulates velocity, smoothing oscillations. Adam adapts learning rates per parameter, accelerating convergence in ill-conditioned landscapes.
This is curve fitting scaled to millions of dimensions. Loss is a function over parameter space. Gradient descent follows the steepest descent direction locally, adjusting each parameter to reduce error. It is a mechanical procedure: compute gradient, step downhill, repeat. No intelligence required—just differentiation and subtraction. Yet it suffices to train models that generate coherent text, recognize images, fold proteins.
The loss landscape is a high-dimensional surface. Modern networks have billions of parameters—each a dimension. Visualizations reduce this to 2D or 3D slices, showing valleys and peaks, but they mislead. In high dimensions, the geometry changes. Critical points where are not typically local minima. Most are saddle points: loss increases in some directions, decreases in others. Random matrix theory suggests the fraction of true local minima approaches zero as dimensionality grows. Saddle points have escape routes. Gradient descent can follow downhill directions away from saddles, while evolution—sampling random perturbations—must guess directions from an exponentially expanding space, embodying the curse of dimensionality.
This inverts our low-dimensional intuition. High dimensionality helps gradient descent by reducing bad local minima, replacing them with saddle points that have escape paths. It cripples uninformed search by diluting productive directions among countless useless ones. Gradient information—the derivative vector pointing toward steepest descent—concentrates search effort, scaling linearly in dimension rather than exponentially.
High-Dimensional Geometry and Optimization Limits
Overparameterization—more parameters than data points—seems paradoxical. Classical statistical theory predicts overfitting: models fitting noise, failing to generalize. Yet modern networks generalize well despite vast overparameterization. One explanation: many global minima exist, and gradient descent implicitly regularizes by finding minima with certain geometric properties. Flat minima—regions where loss changes slowly under parameter perturbation—generalize better than sharp minima. Networks initialized randomly and trained via gradient descent tend toward flat regions, robust to parameter noise.
Mode connectivity is another puzzle. Distinct minima found from different random initializations often connect via low-loss paths—not isolated peaks in a rugged landscape but parts of an interconnected tunnel system. Loss landscapes in overparameterized regimes have structure we do not fully understand.
Gradient descent has limits. It requires differentiability. Activation functions must be smooth almost everywhere. Discontinuous components—binary step functions, certain discrete operations—break gradient flow. Evolutionary algorithms can optimize non-differentiable networks by mutation and selection, but they scale poorly. Biological evolution produces diversity—eyes, wings, brains—not convergence to a single optimum. Our optimization algorithms, gradient-based or evolutionary, capture only fragments of evolution’s creative dynamics.
The question is not whether machines can think, but what they can compute. Backpropagation and gradient descent define a mechanical procedure for learning: differentiable function approximation through iterative parameter adjustment on loss landscapes. This is computation through optimization, not symbolic state transition. It scales to billions of parameters because gradients provide local slope information that guides search efficiently in high dimensions. It succeeds not despite high dimensionality but because of it—saddle points replace bad minima, flat regions implicitly regularize.
Can machines learn? Yes, by mechanically following gradients downward through parameter space. The universal machine I described in 1936 computes by rewriting symbols according to rules. The learning machine of 2024 computes by adjusting weights according to derivatives. Both are mechanical. Both are universal within their domains. Neither requires consciousness. The abstraction changes, but the essential principle—systematic, rule-governed transformation yielding complex behavior—remains. Computation is not one thing. It is a family of phenomena, unified by algorithmic structure.
Source Notes
9 notes from 2 channels
Source Notes
9 notes from 2 channels