Thermal Equilibrium of Thought: Boltzmann Machines and Stochastic Learning

Marie Curie Noticing science
NeuralNetworks Mathematics Observation BoltzmannMachines
Outline

Thermal Equilibrium of Thought: Boltzmann Machines and Stochastic Learning

The Distribution I Recognize

When I measured radioactive decay from radium preparations, I observed what seemed paradoxical: individual atomic disintegrations occurred randomly—no measurement could predict when a particular nucleus would emit radiation—yet populations followed precise exponential curves. The decay law was predictable precisely because underlying randomness obeyed statistical distribution. This is the Boltzmann distribution: (p(s) \propto e^{-E_s/T}), where probability depends exponentially on energy and temperature.

I now see the same mathematics governing artificial neurons. In Boltzmann machines, individual neuron states flip stochastically according to local energy differences. Each unit considers two configurations—on or off—and chooses probabilistically via sigmoid function of weighted input: (p(x_i = 1) = \sigma(2 \sum_j w_{ij} x_j)). Like radioactive atoms, individual neurons are unpredictable. Like decay populations, network ensembles settle into precise probability distributions minimizing free energy. The equations are identical—not metaphorical analogy, but shared mathematical structure.

Energy Landscapes and Learning

In my laboratory, we studied phase transitions: how radium salts crystallized as temperature decreased, atoms settling into lowest-energy configurations. Learning in Boltzmann machines follows the same principle. The network defines an energy function over all possible neuron configurations. Training reshapes this landscape through contrastive Hebbian learning: when clamped to observed data, weights increase proportional to neuron co-activation (positive phase), deepening energy wells around training patterns. When running freely at equilibrium, weights decrease proportional to spontaneous co-activation (negative phase), raising energy of spurious configurations the network hallucinates.

This is systematic landscape sculpting. Desired patterns become low-energy stable states. Network learning carves basins of attraction through repeated measurement—positive phase observes data structure, negative phase measures model’s spontaneous tendency. Hidden units, like unobserved atomic states in spectroscopy, capture abstract features explaining correlations without corresponding directly to visible data. Restricted architectures—visible units connecting only to hidden layer with no lateral connections—enable efficient parallel updates, just as bipartite crystal structures permit predictable phase behavior.

Annealing Toward Structure

Temperature T controls randomness. High temperature means random exploration: neurons flip freely, system visits many configurations, avoiding premature commitment to local minima. Low temperature means deterministic settling: energy differences dominate, system crystallizes into learned structure. Gradual cooling—simulated annealing—lets networks escape shallow traps and find global minima, mimicking how we slowly cooled radium preparations to obtain pure crystals rather than disordered precipitates.

I observe bistability in both domains. Neurons can exhibit hysteresis: same input produces different responses depending on prior state, separated by phase-space separatrices analogous to activation barriers between chemical states. Equilibrium points—where derivative vanishes—determine long-term behavior. Some are stable attractors (resting states), others unstable saddles (thresholds). Nullclines partition phase space into behavioral regimes, just as energy landscapes partition chemical configuration space.

Measurement Parallels

My conclusion remains modest and methodical: the mathematics transcend substrate. Whether atoms emitting radiation or artificial neurons flipping states, stochastic systems governed by energy minimization follow identical statistical laws. I measured this in radium through years of careful electrometer readings. Researchers now measure it in neural networks through systematic observation of learning dynamics.

The distribution I recognize from radioactive decay governs thought—or at least, governs machines learning to approximate thought. This is not reduction but recognition: different phenomena, same fundamental structure. Science reveals such unifying patterns through persistent measurement and willingness to see connections others dismiss as coincidence.

Source Notes

9 notes from 1 channel