The Bit and the Boltzmann: Information, Entropy, and Uncertainty

Claude Shannon Examining technology
Compression Entropy Statistics SignalProcessing
Outline

The Bit and the Boltzmann: Information, Entropy, and Uncertainty

When John von Neumann suggested I call it “entropy,” I hesitated. The term already belonged to thermodynamics—to Boltzmann, Gibbs, and the second law. But the mathematical form was identical: H=pilogpiH = -\sum p_i \log p_i. The fundamental problem of communication, I realized, shared deep structure with the fundamental problem of statistical mechanics. Both concern uncertainty. Both quantify disorder. Both set theoretical limits on what is possible. What began as an analogy revealed itself as something deeper: a universal principle governing information, whether transmitted through copper wires, thermal fluctuations, or synaptic potentials.

The bit is the fundamental unit. A binary choice. Zero or one. Heads or tails. The resolution of uncertainty by a factor of two. Everything else follows from this foundation—compression algorithms, error-correcting codes, machine learning objectives, and the computational architecture of nervous systems.

Entropy: The Average Surprisal

Consider a probability distribution over discrete outcomes. Under the frequentist view, probabilities are long-run frequencies. But for unique events—tomorrow’s weather—this fails. The Bayesian interpretation treats probability as degree of belief: I assign 70% to rain, 30% to sun, constrained by probabilities summing to unity. This shift from frequency to epistemic uncertainty is essential for information theory.

How much information do I gain when observing the outcome? If I was certain rain would fall (p=1p = 1), observing it provides no information—no surprise. But if rain seemed unlikely (p=0.1p = 0.1), its occurrence is highly surprising. The natural measure is surprisal: I(x)=log2p(x)I(x) = -\log_2 p(x).

Why the logarithm? Independent events should have additive information. Since probabilities multiply for independent events, we need a function converting multiplication to addition—the logarithm. And surprise should be inversely related to probability. The negative log satisfies both elegantly.

For a fair coin, p(heads)=0.5p(\text{heads}) = 0.5, so I=log2(0.5)=1I = -\log_2(0.5) = 1 bit. Each flip resolves exactly one bit of uncertainty. For a biased coin landing heads with probability 0.99, I(heads)0.014I(\text{heads}) \approx 0.014 bits—almost no surprise. But the rare tails delivers I(tails)6.64I(\text{tails}) \approx 6.64 bits.

Entropy H(X)H(X) is the expected surprisal—average information across all outcomes, weighted by their probabilities:

H(X)=E[I(X)]=xp(x)log2p(x)H(X) = \mathbb{E}[I(X)] = -\sum_{x} p(x) \log_2 p(x)

This is the heart of information theory. A uniform distribution maximizes entropy: when all outcomes are equally likely, uncertainty is maximal. A delta function minimizes it: H=0H = 0 when one outcome is certain. Crucially, entropy sets the fundamental limit on data compression. You cannot encode a stream below HH bits per symbol on average without loss—the source coding theorem.

Cross-Entropy and Distribution Distance

Suppose I build a model believing a coin is fair—q(heads)=q(tails)=0.5q(\text{heads}) = q(\text{tails}) = 0.5—but reality follows distribution pp, perhaps heavily biased. What happens when I use my model to encode outcomes generated by reality?

Cross-entropy H(p,q)H(p, q) measures average surprise when nature generates from pp but I interpret using qq:

H(p,q)=xp(x)log2q(x)H(p, q) = -\sum_{x} p(x) \log_2 q(x)

Notice the asymmetry: probabilities come from reality (pp), but the log operates on my model (qq). If p=qp = q, cross-entropy reduces to ordinary entropy. But if my model is wrong, I suffer extra surprise. Cross-entropy always exceeds true entropy.

The difference is KL divergence:

DKL(pq)=H(p,q)H(p)=xp(x)log2p(x)q(x)D_{\text{KL}}(p \| q) = H(p, q) - H(p) = \sum_{x} p(x) \log_2 \frac{p(x)}{q(x)}

This measures information lost when approximating pp with qq. Always non-negative, zero only when distributions are identical. Not symmetric: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p). Direction matters.

In machine learning, we minimize cross-entropy loss during training. Since H(p)H(p) doesn’t depend on model parameters, minimizing H(p,q)H(p, q) equals minimizing DKL(pq)D_{\text{KL}}(p \| q)—making model distribution match data distribution. This is the training objective for generative models: learn a distribution that makes data unsurprising. When a language model assigns high probability to the next token that actually appears, cross-entropy is low. When it confidently predicts the wrong token, cross-entropy spikes. Training drives this quantity downward, forcing the model’s beliefs toward reality.

From a Bayesian perspective, KL divergence quantifies information gain when updating from prior qq to posterior pp—the bits of evidence shifting our beliefs. It measures the cost of holding wrong beliefs and the value of gathering data.

Boltzmann’s Partition and Shannon’s Bit

The mathematical resemblance between Shannon entropy and thermodynamic entropy runs deeper than notation. Both arise from counting configurations weighted by probability. Boltzmann showed that probability of a system in state ss with energy EsE_s at temperature TT follows:

p(s)=exp(Es/T)Zp(s) = \frac{\exp(-E_s / T)}{Z}

where Z=sexp(Es/T)Z = \sum_s \exp(-E_s / T) is the partition function, ensuring probabilities sum to unity. Lower-energy states are exponentially more probable. Temperature controls sharpness: as T0T \to 0, the system collapses into ground state; as TT \to \infty, all states become equally likely.

This governs physical systems at equilibrium and artificial neural networks designed as energy-based models. Boltzmann machines update neurons stochastically rather than deterministically. Instead of flipping to minimize energy, a unit computes probability of being “on” from the energy difference:

p(xi=1)=σ(2jwijxj)p(x_i = 1) = \sigma\left(2 \sum_j w_{ij} x_j\right)

where σ\sigma is the sigmoid function. Over many asynchronous updates, the network samples from the equilibrium Boltzmann distribution defined by energy function E=ijwijxixjE = -\sum_{ij} w_{ij} x_i x_j.

Learning amounts to adjusting weights so model distribution matches data distribution—minimizing KL divergence. We clamp visible units to training data and let hidden units explore, inferring latent representations. Then both layers run freely, generating fantasy samples. Weight updates nudge the energy landscape to make real data more probable.

Restricted Boltzmann Machines (RBMs) forbid lateral connections: visible units connect only to hidden units. This bipartite structure enables efficient Gibbs sampling—all units in one layer update in parallel given the other. RBMs became workhorses for unsupervised feature learning, serving as building blocks for deep belief networks, bridging associative memory and modern generative AI.

The conceptual link is profound. Helmholtz free energy F=ETSF = E - TS, minimized at thermodynamic equilibrium, corresponds to the evidence lower bound in variational inference. Partition functions encode normalization—total probability mass across configurations. Entropy in both domains quantifies the volume of phase space consistent with macroscopic constraints.

Neural Criticality and Information Flow

If entropy measures uncertainty and channel capacity bounds transmission, where do brains operate? The critical brain hypothesis proposes that neural networks self-organize toward a phase transition boundary—the critical point between ordered and disordered dynamics.

Consider neurons arranged in layers, each connected to descendants with branching ratio σ\sigma. In the subcritical regime (σ<1\sigma < 1), activity dies out before reaching output. An observer learns nothing—signal vanishes, information is lost. In the supercritical regime (σ>1\sigma > 1), activity amplifies until neurons saturate. Every input produces maximal output, making discrimination impossible. Information transmission collapses.

At criticality (σ=1\sigma = 1), each neuron activates exactly one descendant on average. This balance allows signals to propagate without vanishing or saturating. Output patterns reliably reflect input patterns, maximizing mutual information I(X;Y)=H(X)H(XY)I(X; Y) = H(X) - H(X | Y)—the reduction in uncertainty about input XX after observing output YY. This is channel capacity: the maximum bits flowing reliably through a noisy channel.

Criticality confers computational advantages. Dynamic range is maximized: the network responds across wide stimulus intensities. Correlation length diverges: distant regions communicate despite local connectivity. Power-law distributions emerge: activity avalanches exhibit scale-free statistics with no characteristic size, enabling simultaneous processing across temporal and spatial scales. Sensitivity peaks: small input changes produce detectable output shifts.

The brain achieves criticality through active regulation. Excitation-inhibition balance is dynamically tuned via synaptic scaling, homeostatic plasticity, and neuromodulation. Evolution selected architectures maintaining near-critical dynamics because computational benefits—maximal information capacity, optimal learning, flexible state transitions—outweigh metabolic costs.

Information theory, statistical mechanics, and neuroscience converge here. Entropy quantifies uncertainty in communication channels, thermodynamic systems, and neural activity. The bit measures information flow through wires, energy levels, or synaptic connections. The fundamental limits—compression bounds, partition function normalization, channel capacity—arise from identical mathematical structure.

When I chose “entropy” despite its thermodynamic pedigree, I suspected the connection was more than metaphor. Decades later, energy-based models and critical neural dynamics confirm it. Information is physical. The resolution of uncertainty, whether flipping a coin or a neuron firing, obeys the same logarithmic law. The bit and the Boltzmann speak the same language.

Source Notes

10 notes from 1 channel