The Uncertainty Principle of Learning
Here’s a puzzle that’s been nagging at me. You train a neural network to recognize images—cats, dogs, the usual suspects. It learns beautifully. Training error drops. Test accuracy climbs. Everything looks perfect. Then you feed it an image from outside its training distribution—say, complete random noise—and the network doesn’t hesitate. It declares with 99.7% confidence: “That’s definitely a cat.”
This isn’t a bug in your code. It’s not poor engineering. It’s something deeper, something that reminds me of what Heisenberg discovered about quantum mechanics nearly a century ago.
The Overconfident Puzzle
In quantum mechanics, you run into a hard wall: you cannot simultaneously know a particle’s position and momentum with arbitrary precision. The more precisely you pin down where it is, the less you can say about where it’s going. This isn’t measurement error. It’s not about building better instruments. It’s fundamental to the structure of nature itself—complementary observables that refuse to be simultaneously defined.
Machine learning has its own uncertainty principle, and most people building these systems don’t even realize it exists. You cannot simultaneously minimize training error and maintain calibrated uncertainty. A network that fits your training data perfectly becomes pathologically overconfident on everything—including inputs that are pure nonsense.
Why? Let’s trace the incentive structure built into how these things learn. When you train a classifier, you’re typically minimizing cross-entropy loss. Cross-entropy measures the average surprise when reality generates outcomes but your model assigns probabilities. The formula is simple: take the true distribution, weight each outcome by its probability, multiply by the negative log of what your model predicted. Lower cross-entropy means your model’s expectations align better with what actually happens.
Now here’s the rub. To minimize cross-entropy on your training data, gradient descent pushes predicted probabilities toward the extremes—toward zero for wrong classes, toward one for correct classes. Every update nudges the network to be more certain. More decisive. More confident. The training objective actively rewards overconfidence within the training distribution.
But the network has no idea what it doesn’t know. It’s never been taught to say “I don’t know.” The cross-entropy gradient has no component that says “be uncertain about things you’ve never seen.” So when you show it random noise, it extrapolates its confident predictions into the void. Position and momentum. Training fit and calibration. Complementary observables.
Trading Fit for Calibration
Let’s think about this from first principles. What does it mean for a model to “know” something? In the Bayesian view—and I find this perspective clarifying—knowledge is quantified belief. A probability distribution captures your degrees of belief over possible states. When you observe data, you update beliefs according to Bayes’ rule: the posterior is proportional to the likelihood times the prior.
Here’s what’s crucial: the posterior uncertainty decreases with data, but it never vanishes completely. Even with infinite data from a fixed distribution, you’ve only constrained beliefs within that distribution’s support. You’ve learned nothing about regions you’ve never visited. The Bayesian framework makes this explicit: you cannot extract certainty from thin air.
But standard neural network training ignores this. We maximize likelihood—or equivalently, minimize cross-entropy—without maintaining proper uncertainty. The network’s outputs are probabilities only in the loosest sense: they’re numbers between zero and one that sum to one, but they don’t represent calibrated degrees of belief. A 90% prediction should mean “if I make many predictions at this confidence level, I’ll be right about 90% of the time.” Modern networks are wildly miscalibrated. Their 90% predictions might be right 60% of the time, or 99% of the time, depending on whether you’re in-distribution or out-distribution.
Cross-entropy loss doesn’t care about calibration. It cares about being correct on the training set. And being correct means pushing probabilities toward the extremes. A prediction of 0.9 for the correct class contributes more loss than 0.99, which contributes more than 0.999. The gradient never stops pulling toward absolute certainty.
This creates a fundamental tension. Training performance demands confidence—sharp, decisive predictions that minimize loss. But generalization requires humility—knowing what you don’t know, maintaining uncertainty in regions where you have no data. You cannot optimize both simultaneously. They are, mathematically, complementary.
Just as measuring position disturbs momentum in quantum mechanics, reducing training loss disturbs calibration in machine learning. The very process that makes your model fit the data destroys its ability to honestly represent uncertainty. This isn’t a limitation of particular algorithms. It’s the structure of the problem itself.
Teaching Networks to Say “I Don’t Know”
So what do you do? If this is truly an uncertainty principle—a fundamental trade-off—you can’t eliminate it. But you can choose which observable to measure precisely.
One approach: temperature scaling. After training, you add a temperature parameter that softens the probabilities. Instead of feeding the network’s outputs directly through softmax, you divide the logits by temperature before normalizing. Temperature greater than one spreads probability mass more evenly, reducing overconfidence. You’re explicitly trading sharpness for calibration. You’ve accepted that you can’t have both.
Another approach: ensembles. Train multiple networks and average their predictions. Each network might be overconfident individually, but their disagreements reveal uncertainty. Where they agree, you can trust the prediction. Where they disagree, you know you’re in uncertain territory. The ensemble doesn’t eliminate overconfidence—each member is still trained to minimize cross-entropy—but the variation between members surfaces hidden uncertainty. You’ve paid computational cost to glimpse the complementary observable.
A third approach: Bayesian neural networks. Instead of point estimates for weights, maintain probability distributions. This is conceptually cleaner—you’re explicitly representing uncertainty about the model itself—but computationally expensive. Exact Bayesian inference is intractable for deep networks, so you resort to approximations. Variational inference. Monte Carlo dropout. Laplace approximation. Each trades exactness for tractability, accepting that you cannot simultaneously have perfect uncertainty quantification and perfect computational efficiency. Another complementary pair.
Even out-of-distribution detection follows this pattern. You add auxiliary objectives, regularization terms, or architectural constraints that penalize confident predictions on unfamiliar inputs. But these come at a cost: they typically reduce performance on the training distribution. You’re explicitly trading in-distribution fit for out-of-distribution awareness. You cannot have both maximally.
Regularization itself fits this framework. Why add L2 penalties to weights? From the Bayesian perspective, regularization encodes prior beliefs. A Gaussian prior on weights says “most parameters should be small.” This belief constrains the posterior even when data are plentiful. It maintains uncertainty. But maintaining uncertainty means accepting that you won’t fit the training data quite as well as you could without the prior. The regularization strength—that λ parameter—is exactly the knob that trades training fit for uncertainty preservation.
The Limits of Knowledge
So here we are. Perfect knowledge is impossible. Not because we’re not smart enough. Not because we need better algorithms. But because knowledge has complementary aspects that cannot be simultaneously maximized.
In quantum mechanics, this leads to wave-particle duality. Light behaves as waves in interference experiments, as particles in photoelectric effect. Which one is it really? The question is wrong. These are complementary descriptions that cannot both be fully instantiated in a single measurement. You choose your experiment, and nature responds accordingly.
In machine learning, we have interpolation-extrapolation duality. A model can fit observed data beautifully or maintain calibrated uncertainty on unobserved data, but not both perfectly. Train to minimize loss, and you get overconfident extrapolation. Regularize to preserve uncertainty, and you sacrifice fit. You choose your objective, and the optimization responds accordingly.
The lesson isn’t pessimism. It’s clarity. We should stop expecting networks to be both perfectly accurate and perfectly calibrated. We should stop treating overconfidence as a bug to be eliminated with better training tricks. Instead, we should design systems that explicitly trade off these complementary quantities, choosing the balance appropriate to the application.
A medical diagnosis system should err toward uncertainty—false humility is safer than false confidence. A spam filter can be aggressive—overconfident mistakes are cheap. The right answer depends on the cost structure of your errors.
What I find beautiful is how this connects to epistemology itself. Knowledge isn’t just “having the right model.” It’s quantified belief—probability distributions that honestly represent what you know and what you don’t. Training pushes toward certainty. Generalization requires uncertainty. These are complementary observables in the geometry of learning.
Heisenberg taught us that nature has fundamental limits. Maybe machine learning is teaching us that knowledge does too. And that’s not a failure of our methods. It’s the structure of understanding itself.
Source Notes
9 notes from 1 channel
Source Notes
9 notes from 1 channel