How Do Things Learn? From Neurons to Neural Networks

Richard Feynman Clarifying technology
MachineLearning Backpropagation Neuroscience GradientDescent Learning HebbianLearning PredictiveCoding
Outline

How Do Things Learn? From Neurons to Neural Networks

Here’s a question worth asking: What does it actually mean to learn something? I don’t mean memorize—a hard drive can memorize. I mean really learn, where you adjust what you do based on what went wrong. When a toddler touches a hot stove and pulls back, when you get better at shooting free throws, when your brain figures out how to ride a bicycle—what’s happening in there?

The beautiful thing is that whether we’re talking about a biological neuron in your cortex or an artificial neuron in a deep neural network, learning turns out to be the same fundamental problem: how do you adjust your behavior to reduce errors? The machinery looks different—one uses electrochemical signals and synaptic weights, the other uses floating-point numbers and matrix multiplication—but the mathematics underneath is surprisingly similar. Both are trying to roll downhill in a landscape of mistakes.

What Does It Mean to Learn?

Let’s start simple. Imagine you’re trying to teach someone to classify images. Show them a picture of a cat, they say “dog.” That’s an error. The question is: how should they adjust their internal machinery to do better next time?

For a biological neuron, the answer came from Donald Hebb in the 1940s: “Neurons that fire together, wire together.” If two neurons are active at the same time, strengthen the connection between them. This is wonderfully local—each synapse only needs to know about the activity on both sides of itself. No global supervisor required. When you want to store a memory pattern in a network, you can do it by setting the weight between neuron i and neuron j proportional to the product of their activity levels in that pattern. Co-active pairs get positive weights, anti-correlated pairs get negative weights. Store multiple patterns by simply adding up their contributions.

This creates something remarkable: each stored pattern becomes a valley in an energy landscape. The network’s total energy is just the sum of how well all the neurons agree with their neighbors through their weighted connections. When you start near a stored pattern, the network naturally rolls downhill—neurons flip to reduce conflicts—until it settles into that memory. It’s like a ball bearing finding the bottom of a bowl.

Now here’s where it gets interesting. That same “rolling downhill” intuition applies to artificial neural networks, but with a twist. Instead of an energy landscape over neuron states, we have a loss landscape over parameters. Imagine an 18-dimensional space—17 dimensions for the weights and biases in a small network, plus one dimension for how wrong the network is. Before training, the network sits somewhere high up in this space, making terrible predictions. Learning means finding your way downhill.

Rolling Downhill in Error Space

But how do you find downhill when you’re in 13,000 dimensions? You can’t visualize it. You can’t just look around and see which way is down. This is where backpropagation becomes beautiful—it’s a way to efficiently compute exactly which direction is steepest downhill from wherever you currently are.

The key insight is the chain rule from calculus. If your network is a composition of functions—layer after layer transforming inputs into outputs—then the derivative of the whole thing is just the product of derivatives of each piece. Start at the output: compare what you predicted to what you should have predicted, measure the error. Now ask: how would tweaking each neuron in the last layer change that error? Then work backward: how would tweaking neurons in the second-to-last layer change the last layer, which changes the error? Keep going, multiplying these sensitivity factors together as you propagate backward through the network.

What you get is a gradient—a vector in that impossibly high-dimensional space pointing in the direction of steepest descent. Move your parameters a little bit in that direction, and you’ll reduce the error most quickly. This is gradient descent, and backpropagation is the efficient algorithm for computing the gradient.

The remarkable thing is that this works at all. In high-dimensional spaces, you’d think you’d get trapped in local minima—little bowls that aren’t the deepest bowl. But it turns out that in really high dimensions, most critical points aren’t local minima—they’re saddle points. Imagine a mountain pass: downhill in some directions, uphill in others. The gradient can usually find a way through. The geometry of the loss landscape becomes cooperative rather than hostile.

And here’s what’s really happening as the network trains: it’s building a hierarchy of features. The first layer might learn to detect simple edges or color boundaries. The second layer combines those edges into corners and curves. The third layer builds textures and simple shapes. Each layer operates on increasingly abstract representations, transforming raw pixels into concepts.

This hierarchical construction happens naturally because each layer sees its input through the lens of all previous transformations. Early layers create simple divisions in the data space. Later layers fold and reshape those already-divided regions into more complex structures. A network with two hidden layers doesn’t just divide the input space twice—it divides it, then subdivides the subdivisions, creating a rich hierarchy of regions that can fit intricate decision boundaries.

Why Biology and Machines Found Similar Solutions

Now let me tell you about predictive coding, because this is where biological learning gets even closer to backpropagation. The brain doesn’t have a separate forward pass and backward pass. Instead, imagine every neuron has two signals: its current activity and what higher layers predict it should be doing. The difference between them is the prediction error. The whole network is like a mechanical system with springs connecting neurons to their predictions. When sensory input comes in, all the springs get stretched, creating tension. Learning means adjusting everything—both neural activities and synaptic weights—to minimize that total spring tension.

This is continuous energy minimization. The math is different from backpropagation’s discrete forward-and-backward passes, but the principle is identical: follow the gradient downhill on an error landscape. And critically, it’s biologically plausible—each synapse only needs local information, not some magical error signal teleported from the output layer.

What we’re seeing is convergent evolution of algorithms. Whether you’re a brain made of neurons or a neural network running on GPUs, you’re facing the same optimization problem: adjust your parameters to reduce the mismatch between predictions and reality. The brain does it with electrochemistry and continuous dynamics. Artificial networks do it with calculus and matrix algebra. But both are gradient descent. Both are rolling downhill.

The details matter, of course. Biological learning has temporal dynamics, local plasticity rules, and architectural constraints that backpropagation doesn’t worry about. Artificial networks can calculate exact gradients and update billions of parameters in parallel. But the core idea—learning as error minimization through gradient-following—appears to be a deep principle, something the universe keeps discovering whenever information-processing systems need to adapt.

Here’s what I find beautiful about this: it’s not that we copied the brain to build neural networks. It’s that the problem itself constrains the solution. If you want to adjust a complex system with many interacting parameters to minimize some measure of error, you’re going to end up doing something that looks like gradient descent. The math forces your hand. Hebbian learning, predictive coding, backpropagation—they’re all variations on the theme of “figure out which way is downhill, then go that way.”

Personal Reflection

I’ve spent my life trying to understand things from first principles, and this is as pure an example as I’ve seen. Learning isn’t mysterious—it’s optimization. Whether it’s synaptic weights or network parameters, whether it’s energy functions or loss landscapes, the question is always the same: how do I adjust my configuration to reduce my errors?

The first principle is simple: if you can measure how wrong you are, and you can compute how your wrongness changes as you adjust your parameters, then you can learn. Everything else—the chain rule, the computational graphs, the hierarchical features, the energy landscapes—is just the machinery for making that principle computationally tractable.

What you cannot create, you do not understand. And now that we can create systems that learn—both in silicon and in our mathematical models of neurons—we understand learning a bit better. Not perfectly, not completely, but enough to see the elegant simplicity underneath.

The beauty is that it all comes down to rolling downhill. Nature found that trick with neurons. We found it with calculus. Same hill, different shoes.

Source Notes

8 notes from 3 channels