Measuring Uncertainty: Probability Measures and Stochastic Gradient Descent

Marie Curie Noticing science
Observation SignalProcessing Physics ScientificMethod NaturalPhilosophy
Outline

Measuring Uncertainty: Probability Measures and Stochastic Gradient Descent

When I measured radioactive decay in my Paris laboratory, I learned to listen to randomness with precision. Each click from the electrometer represented a single atom’s transformation—inherently unpredictable—yet when I counted thousands of clicks over minutes, the half-life emerged with remarkable consistency. Individual chaos yielding collective order: this was not mysticism but mathematics.

Today’s neural networks navigate similar terrain. Gradient descent updates parameters by sampling random minibatches from training data—discrete draws from a continuous distribution, much like detecting individual alpha particles emitted from a radioactive source. The mathematics underlying both phenomena requires measure theory: a rigorous framework for assigning “size” to sets in ways that respect probability’s essential structure.

Which Sets Can We Measure?

Not every subset of a space deserves measurement. A measure operates on a sigma-algebra—a collection of sets closed under complements and countable unions. This constraint prevents paradoxes while preserving the countable additivity that makes probability coherent: if events are disjoint, their combined probability equals the sum of individual probabilities.

In neural network training, we face analogous constraints. The model’s loss landscape defines a measurable space where we seek regions of low error. But we cannot measure every conceivable subset of parameter space; we can only evaluate loss on the sigma-algebra of practically computable sets—those accessible through our chosen optimization procedure. Gradient descent explores this measurable structure by following local slopes, much as I traced radiation patterns by systematic electrometer readings across pitchblende samples.

From Continuous Distributions to Discrete Samples

The Dirac measure concentrates all probability mass at a single point: δx(S)=1\delta_x(S) = 1 if xSx \in S, zero otherwise. This idealization captures what happens after training converges. The network’s random walk through parameter space—bouncing between gradient estimates computed on different minibatches—eventually settles near a point representing learned weights. The training trajectory collapses from a diffuse probability distribution over parameter space to something approximating a Dirac measure at the final configuration.

Yet during training, randomness serves a purpose beyond mere sampling efficiency. Regularization techniques introduce controlled noise—dropout disabling random neurons, data augmentation applying random transformations. This resembles how I distinguished systematic measurement error from background radiation fluctuations. Both scenarios require understanding when randomness improves outcomes and when it merely obscures signal.

Countable Steps, Continuous Convergence

Measure theory demands countable additivity: the measure of infinitely many disjoint sets equals the sum of their individual measures. Gradient descent performs countably many parameter updates, each incorporating noise from stochastic minibatch selection. The mathematics guarantees that under suitable conditions—much like how sufficiently long observation periods reveal stable decay rates—these noisy steps converge to regions minimizing loss.

The parallel runs deeper: my radioactive decay measurements relied on the law of large numbers transforming atomic randomness into macroscopic predictability. Stochastic gradient descent relies on the same principle. Each minibatch gradient is a random variable; averaging over many steps produces reliable convergence despite per-step uncertainty.

What remains to be understood is whether randomness in learning merely approximates the true gradient—a computational convenience—or whether stochasticity itself contributes essential properties that deterministic methods cannot capture. In my experiments, I could never predict which atom would decay next, only the statistical behavior of populations. Perhaps neural networks, too, require this irreducible randomness not as limitation but as feature.

Source Notes

6 notes from 2 channels