Finding Valleys in Million-Dimensional Space
Alright, let’s talk about learning. Not the kind where you sit in a classroom, but the kind where a computer figures something out. And here’s the thing that fascinates me: when you look at it the right way, it’s all about geography. About hills and valleys. About being lost in a space so big you can’t even picture it, yet somehow finding your way downhill.
The Geography of Error
Imagine you’re standing on a hillside. You can’t see very far—maybe there’s fog, or it’s night—but you can feel which way is down. You take a step downhill. Then another. Eventually, you end up in a valley. That’s it. That’s the whole idea.
Now here’s where it gets interesting. What if the hillside has a million dimensions?
You see, when we train a neural network, we’re not adjusting one number. We’re adjusting thousands, sometimes millions of numbers—all the weights and biases that control how the network responds. Each one of those numbers is like a direction you could move. If you’ve got 13,000 parameters, you’re standing in a 13,000-dimensional space. And somewhere in that space is a valley—a configuration where your network makes good predictions and the error is low.
The landscape itself is defined by error. At every possible combination of parameters, there’s a height—the loss, we call it—measuring how wrong your network’s predictions are. High loss means you’re on a mountain peak, making terrible predictions. Low loss means you’re in a valley, making good ones. The whole training process is just rolling downhill in this absurdly high-dimensional space.
But you can’t see the whole landscape. You’re standing at one point, and all you know is: which way is down from here?
Following the Steepest Path
This is where the gradient comes in, and it’s actually a beautiful piece of mathematics. The gradient is a vector—a direction—that points uphill. More precisely, it points in the direction where the function increases fastest. Each component tells you: if I wiggle this particular parameter, how much does my error change?
So you compute this gradient vector, and then you do something clever: you go the opposite direction. You step downhill. The gradient points up, so negative gradient points down.
Now, in a two-dimensional landscape, this is obvious. The gradient is just two numbers: how steep is it in the x-direction, how steep in the y-direction? You can draw it as an arrow on a map. But when you’ve got 13,000 dimensions, the gradient is a list of 13,000 numbers, each one telling you how sensitive your error is to one specific parameter. You can’t visualize it, but the math doesn’t care. You still just step in the direction that decreases error most quickly.
This algorithm—gradient descent—is almost absurdly simple. Start somewhere random. Compute the gradient. Take a small step downhill. Repeat. And somehow, after thousands of iterations, you end up with a network that can recognize faces or translate languages or play chess.
The question is: why does this work?
The Problem of Valleys
Let’s think about what could go wrong. The most obvious problem is that you might get stuck in the wrong valley. These are called local minima—places where you’re at the bottom of your particular valley, but there are deeper valleys elsewhere in the landscape. If you start on the wrong hillside, gradient descent will march you down to a mediocre solution and stop, because every direction from there is uphill.
For years, people worried about this. With complex landscapes in high dimensions, surely there must be millions of local minima, each one a trap waiting to catch your algorithm?
But here’s the strange thing that researchers discovered: in very high dimensions, local minima are rare. What you get instead are saddle points—places where you’re at a minimum in some directions but a maximum in others, like a mountain pass. And saddle points don’t trap you. There’s always some direction that’s still downhill, so gradient descent keeps moving.
The math here is actually kind of beautiful. In low dimensions, a critical point—a place where the gradient is zero—is usually either a minimum (a valley bottom) or a maximum (a peak). But in high dimensions, most critical points are saddle points. It’s a geometric fact about high-dimensional spaces. So the very thing that makes the problem seem impossible—the huge number of parameters—actually helps you, because it means you’re unlikely to get truly stuck.
That doesn’t mean training always works. You can still fail for other reasons: maybe the gradient becomes so small you barely move, or maybe you hit a “dead zone” where certain neurons stop responding entirely. But the simple fear—that you’ll be trapped in some local minimum far from the good solutions—turns out to be less of a problem than you’d think.
The Curious Case of Depth
Now let’s ask a different question. Suppose I have 130 neurons to work with. I could arrange them in a shallow network—say, two layers, with most neurons in the first layer. Or I could arrange them in a deep network—five layers, with neurons spread across all of them. Same total neuron count. Which is better?
Before we had good computational experiments, you might have guessed they’d be equivalent. Same number of neurons, same number of parameters, just arranged differently. But the answer is striking: deep networks are vastly more powerful. A five-layer network with 130 neurons can learn patterns that a two-layer network with 100,000 neurons struggles with. We’re talking about an 800-fold difference in parameter efficiency.
Why? It comes back to composition. A shallow network stacks operations side by side. Each neuron does its own thing, and you combine the results. But a deep network stacks operations in sequence. The second layer works on the output of the first layer, which means it’s operating on features, not raw inputs. The third layer works on combinations of those features. And so on.
This compounds. Mathematically, the number of different regions a shallow network can carve out in input space grows only polynomially with the number of neurons. But in a deep network, it grows exponentially with the number of layers. Each layer doesn’t just add capacity—it multiplies it.
It’s like the difference between addition and multiplication. If I give you four numbers and ask you to add them, you get one result. If I ask you to multiply them, you get a different result—and as the numbers get larger, the multiplication result grows much faster. Deep networks are doing something analogous: each layer multiplies the representational power, rather than just adding to it.
So when gradient descent navigates the loss landscape of a deep network, it’s navigating a fundamentally different kind of terrain than it would in a shallow network. The landscape has more useful structure. There are pathways that lead to good solutions more reliably. The geometry of the problem is shaped by the architecture.
Other Ways to Explore
Gradient descent isn’t the only way to search a landscape. You could try evolution. Start with random parameter settings, make copies with small mutations, keep the best ones, repeat. This is called evolutionary search, and it has a certain appeal: it doesn’t need gradients or calculus, it just needs a way to measure fitness.
But evolution is slow. Each mutation requires a full evaluation—run the network on your data, compute the loss, see if it’s better. And in high dimensions, most random mutations don’t help. You’re wandering semi-randomly through a vast space, hoping to stumble downhill. It works for small problems, but as parameter counts grow into the thousands or millions, evolutionary search becomes painfully inefficient.
Gradient descent, by contrast, tells you exactly which way to go. Instead of trying random directions and seeing what happens, you compute which direction is steepest and move that way. It’s the difference between wandering in the fog and having a compass that always points downhill.
Evolution has its place—there are problems where the landscape isn’t smooth, or where you can’t compute gradients, and then randomized search might be your best option. But for neural networks, where we can compute gradients efficiently using backpropagation, gradient descent is almost always faster.
Energy and Attractors
Here’s a connection that delights me: this whole framework of landscapes and valleys shows up in physics too. When a physical system minimizes energy, it’s doing the same thing neural networks do when they minimize loss. The mathematics is identical.
Think about a ball rolling down a bowl. The ball has potential energy based on its height, and it naturally rolls to the lowest point. Or think about how proteins fold: a protein is a chain of amino acids that twists into a specific three-dimensional shape, and it does this by minimizing its energy. The landscape is defined by all possible configurations, and the protein “finds” the right fold by settling into an energy minimum.
In neuroscience, there are models where brain circuits work the same way. Memory retrieval can be thought of as energy minimization: you start with a partial cue—a few notes of a song, say—and the network dynamics drive the state downhill until it settles into a complete memory. The memories themselves are valleys in the energy landscape, and the process of remembering is just rolling downhill to the nearest one.
So when we talk about training neural networks, we’re not just using a metaphor. We’re describing a real geometric process that appears everywhere: in learning, in physics, in biological systems. The landscape is real—it’s just high-dimensional, so we can’t see it. But the math works exactly the same way whether you’re minimizing energy in a protein or loss in a neural network.
Why Landscapes?
Let me step back and ask: why does thinking about landscapes help?
Partly, it’s because our brains are built for spatial reasoning. We understand maps. We understand hills and valleys. And when you translate an abstract optimization problem into geographic terms, suddenly your intuition kicks in. You can ask questions like: Is the landscape smooth or rugged? Are there many valleys or just a few? How do I avoid getting trapped?
But it’s more than just intuition. The landscape perspective reveals structure in the problem. For instance, understanding that deep networks create different landscape geometries than shallow networks explains why depth matters. It’s not just that deep networks are “better”—it’s that they create loss landscapes with more navigable structure, with pathways that lead more reliably to good solutions.
Similarly, understanding saddle points in high dimensions explains why local minima are less of a problem than early researchers feared. The geometry of high-dimensional spaces has properties that low-dimensional spaces don’t, and those properties shape how optimization works.
So the landscape isn’t just a teaching tool. It’s a framework that connects different ideas—gradient descent, network architecture, energy minimization, evolutionary search—into a single coherent picture. And when you see that picture, a lot of mysteries start to make sense.
Training a neural network is about navigating a space too large to imagine. But the principles are the same whether you’re in two dimensions or two million. Follow the gradient downhill. Use the structure of the landscape—the pathways carved by depth, the scarcity of true minima in high dimensions. And eventually, you’ll find a valley where the error is low and the network has learned.
It’s really just rolling downhill. But in a million dimensions, that simple act becomes remarkable.
Source Notes
8 notes from 4 channels
Source Notes
8 notes from 4 channels