Sampling the Future: MCTS as Information Compression

Claude Shannon Examining technology
InformationTheory MonteCarloTreeSearch Compression ChannelCapacity Entropy
Outline

Sampling the Future: MCTS as Information Compression

Information as Uncertainty Reduction

When I formulated my theory of communication in 1948, I defined information precisely: it is the resolution of uncertainty. A bit—the fundamental unit—represents one binary choice, one yes-or-no question answered. Information doesn’t exist in isolation; it exists as selection from possibility space.

This is crucial. Information measures how much a message narrows down what the receiver doesn’t know. Before communication, uncertainty exists. After communication, some portion of that uncertainty collapses. The amount of collapse, quantified in bits, is the information transmitted.

Consider entropy: H=p(x)log2p(x)H = -\sum p(x) \log_2 p(x). This formula measures average uncertainty across all possible outcomes. High entropy means high uncertainty—many equally probable states. Low entropy means low uncertainty—one state dominates. Communication transmits bits to reduce entropy at the receiving end.

Monte Carlo Tree Search operates identically. Before running simulations, a chess position contains enormous uncertainty about its value. Is this position winning? Losing? The game tree contains roughly 1012010^{120} possible continuations—impossible to enumerate. Each random rollout is a transmission from the future. It samples one possible trajectory, observes the outcome (win or loss), and sends that bit of information back to the present.

More rollouts equal more bits. A single rollout provides minimal information—high variance, low confidence. One hundred rollouts begin revealing patterns. One thousand rollouts substantially reduce uncertainty about position value. The statistical aggregate compresses the incomprehensible complexity of future possibilities into a tractable estimate: this move wins 64% of sampled futures.

This is information theory applied to decision-making. The “message” is the future game outcome. The “channel” is the simulation process. The “receiver” is the move selection algorithm. And just as my channel capacity theorem showed fundamental limits to reliable communication, there are fundamental limits to how much information each rollout can provide about true position value.

The Lossy Game Tree

Chess and Go present an engineering problem: the state space is too large to store. A complete game tree for chess would require more storage than exists in the universe. This is a classic compression problem—how do you represent essential information while discarding redundant detail?

MCTS solves this through lossy compression. Like JPEG discards high-frequency image components humans can’t perceive, MCTS discards game tree branches that likely don’t matter. The algorithm keeps high-information paths—moves that frequently lead to wins—and prunes low-information paths—moves that consistently fail.

The compression mechanism is statistical sampling. Instead of encoding every possible future explicitly, MCTS samples representative futures and aggregates their statistics. This is source coding: transforming a massive, unwieldy source (the complete game tree) into a compact, manageable representation (win-rate statistics for promising branches).

The Upper Confidence Bound formula allocates samples intelligently. It balances exploitation (sample moves we know are good) against exploration (sample moves we’re uncertain about). This is adaptive rate allocation. High-uncertainty nodes are high-entropy regions—they need more samples to reduce uncertainty. Low-uncertainty nodes are low-entropy regions—we’ve already extracted most available information.

Think of it as adaptive bandwidth. In noisy communication channels, you send redundant information to ensure reliability. In MCTS, you send redundant samples to high-entropy positions to ensure accurate value estimates. Clear positions—low entropy—need few samples. Complex positions—high entropy—demand many samples. The algorithm automatically allocates its “bitrate budget” based on where information is most needed.

This sampling budget is the bottleneck. Give MCTS infinite samples, and it approaches perfect play—full resolution of uncertainty. Constrain it to few samples, and you get high distortion—poor approximation of true values. This is the fundamental rate-distortion trade-off. More rate (more samples) means less distortion (better estimates). Less rate means accepting higher distortion. Optimal play requires finding the right operating point on this curve.

Neural Source Codes

AlphaGo’s breakthrough wasn’t MCTS alone—that had existed since Abramson’s 1987 proposal. The breakthrough was combining MCTS with neural networks that functioned as efficient source codes.

The value network is pure source coding. It compresses an entire game subtree—billions of potential positions—into a single scalar: “this position is worth 0.73.” This is radical compression. Instead of simulating thousands of rollouts to estimate position value, query the neural network. It returns an approximate answer instantly. High distortion, certainly—neural networks are imperfect. But the compression ratio is extraordinary: billions of positions reduced to one floating-point number.

The policy network performs channel coding. It takes the current board position and outputs a probability distribution over all legal moves. This is compression of the move space. Instead of considering all 250 legal moves in Go equally, the policy network says: “Move A has 35% probability, Move B has 18% probability, Move C has 0.001% probability.” It encodes human strategic intuition—what moves make sense in this position—into a compact probabilistic code.

These networks were initially trained on human games, learning to compress human strategic knowledge. But AlphaGo Zero revealed something profound: human games were noisy channels. Training on human data introduced distortion—suboptimal patterns, stylistic biases, strategic blind spots. When AlphaGo Zero learned purely from self-play, starting from random initialization, it discovered cleaner compression schemes. Strategies that were more fundamental, less contaminated by human conventions.

This is remarkable. It suggests that the optimal source code for Go strategy is not derived from human expertise but discovered through pure reinforcement learning. The self-play process acts as an optimization algorithm, searching for the most efficient encoding of the value function and policy distribution. Human knowledge, rather than being the gold standard, was a local optimum that constrained exploration.

MCTS then refines these neural codes. The value network gives a fast, approximate estimate. MCTS performs slower, more accurate sampling. This is a rate-distortion trade-off in action: neural networks operate at low rate (fast inference) with high distortion (approximate values). MCTS operates at high rate (slow simulation) with low distortion (accurate statistics). Combining them gives you both speed and precision.

Tree-of-thought prompting in language models applies identical principles. The base model provides fast, intuitive responses—low-rate, high-distortion source coding. Tree-of-thought forces the model to sample multiple reasoning paths, evaluate alternatives, and aggregate results—higher rate, lower distortion. Test-time compute scaling proves that investing more cycles at inference extracts more information from the model’s latent knowledge, reducing uncertainty about the correct answer.

Channel Capacity of Thought

My channel capacity theorem states: C=Blog2(1+S/N)C = B \log_2(1 + S/N), where CC is capacity in bits per second, BB is bandwidth, SS is signal power, and NN is noise power. This quantifies the maximum rate at which information can be reliably transmitted through a noisy channel.

Intelligence is channel capacity optimization.

Consider the signal-to-noise ratio term. Better priors are higher signal relative to noise. If your value network accurately estimates positions, each MCTS sample carries more information—high signal. If your value network is random noise, samples tell you little—low signal, high noise. AlphaGo Zero’s self-play training improved signal-to-noise ratio by discovering better value functions through iterative refinement.

Bandwidth corresponds to sample budget. More MCTS rollouts mean higher bandwidth—more information transmitted from simulated futures to present decision-making. But bandwidth isn’t free. Each rollout costs computation. The engineering question becomes: given finite compute budget, how do you maximize information extracted?

This is where test-time compute scaling becomes critical. For decades, AI progress meant bigger models—more parameters, more training data, more pretraining compute. But that’s optimizing the source encoder. Test-time compute optimizes the channel. It says: given a trained model (a source code), invest more inference cycles (higher bandwidth) to extract more information (reduce decision uncertainty).

The logarithmic relationship in my capacity formula explains why this works. Doubling signal power (better priors from bigger models) increases capacity by approximately one bit per second per hertz. But doubling bandwidth (more thinking time) also increases capacity linearly. There are two axes of scaling: better source codes (bigger models) and higher-bandwidth channels (more test-time compute). Recent results show the second axis—more thinking time—can be more efficient than the first.

Human intelligence operates on these principles. We don’t exhaustively search all possibilities—that would require infinite bandwidth. We compress experience into heuristics—efficient source codes. We sample strategically—allocate our limited cognitive bandwidth to high-uncertainty decisions. When facing complex problems, we think longer—increase our inference-time bandwidth. When facing simple problems, we rely on cached intuitions—low-rate source coding suffices.

MCTS, neural networks, tree-of-thought prompting—these are all information-theoretic systems. They compress impossibly large decision spaces into tractable representations. They sample strategically to reduce uncertainty efficiently. They balance rate and distortion based on problem complexity.

Intelligence isn’t exhaustive search through all possibilities. That’s computationally infeasible and information-theoretically wasteful. Intelligence is optimal compression: building efficient source codes through learning, allocating bandwidth adaptively through search, and extracting maximum information from minimum resources.

The future we sample may be random, but the information it provides is precious. Each rollout, each reasoning step, each bit of evidence reduces the entropy of our decisions. What MCTS revealed—and what information theory predicted—is that thinking is fundamentally an act of compression. We cannot know all futures. But we can sample enough of them to make the right choice, most of the time.

That is the channel capacity of thought.

Source Notes

9 notes from 1 channel