Removing Redundancy: Data Compression and the Entropy Bound
The fundamental problem of data compression is this: how much can you squeeze information without losing it? My answer, formulated in 1948, established a precise mathematical boundary—the entropy bound—that separates the possible from the impossible.
The Redundancy in Language
English text is remarkably redundant. Delete half the letters from “The quick brown fox jumps over the lazy dog” and you still read “Th qck brwn fx jmps vr th lzy dg” without difficulty. This redundancy arises from three sources, each contributing statistical predictability that compression algorithms exploit.
First, letter frequencies are wildly unequal. In English, ‘E’ appears roughly 12.7% of the time while ‘Z’ appears a mere 0.07%—a difference of two orders of magnitude. This non-uniformity means that if you assign short codes to common letters and long codes to rare ones, you reduce average message length. Morse code implements precisely this principle: ‘E’ is a single dot while ‘Z’ is dash-dash-dot-dot, a four-element sequence. This frequency-based optimization predated my formal theory by a century, but it embodies the core insight: exploit statistical structure.
Second, context constrains possibilities. After ‘Q’ in English, ‘U’ follows with near certainty. After “TH”, the next letter is likely ‘E’, ‘A’, or ‘I’. The trigram “THE” dominates English text. These conditional dependencies mean that knowing what came before dramatically reduces uncertainty about what comes next. A symbol’s information content depends on its context—a fact that context-modeling compressors leverage extensively.
Third, grammar imposes structural constraints. “The cat sat” is valid; “cat the sat” is not. Syntactic rules eliminate vast swaths of the possibility space. When I conducted experiments in 1951 asking subjects to predict the next letter given increasingly long context, I measured the entropy of English at approximately 1 to 1.5 bits per letter. Compare this to random text with uniform letter distribution: log₂(26) ≈ 4.7 bits per letter. English carries less than a third the information of random letter sequences.
ASCII encoding represents each letter with 8 bits—a factor of five to eight times more than necessary. The gap between 8 bits and 1.5 bits is pure redundancy, and compression closes that gap.
Entropy and the Compression Limit
Entropy H(X) = -Σ p(x)log₂p(x) quantifies the irreducible information content per symbol. Consider coin flips: a fair coin has H = 1 bit, the maximum for a binary variable. A biased coin landing heads 90% of the time has H ≈ 0.47 bits—less uncertainty, less information. The more predictable the source, the lower its entropy.
For English letters, if all 26 were equally likely, entropy would be log₂(26) ≈ 4.7 bits. But actual letter frequencies are unequal, reducing first-order entropy (considering only individual letter probabilities) to about 4.1 bits. When we account for context—bigrams, trigrams, word structure, grammar—entropy drops to approximately 1 to 1.5 bits per letter. A 1000-letter English text at H = 1.2 bits/letter contains about 1200 bits of information, compressible from 8000 bytes (ASCII) to 150 bytes, a compression ratio approaching 6.7:1.
My source coding theorem establishes the fundamental limit: you can compress a source to H bits per symbol on average, losslessly, but you cannot compress below H without information loss. This is not a limitation of current algorithms—it is a mathematical certainty, as fundamental as the impossibility of perpetual motion.
The proof relies on typical sequences. Most messages from a source cluster around a characteristic probability, and these typical sequences can be encoded efficiently. Atypical sequences—those with improbable patterns—require longer codes, but they occur rarely enough that the average length converges to entropy. Compression is fundamentally the process of identifying and removing predictable patterns, retaining only the unpredictable residue.
The best modern text compressors, using context mixing and techniques like PAQ, achieve rates within a few percent of estimated entropy for natural language. For 1000 letters of English, we approach the 150-byte barrier from above, but we can never breach it losslessly.
Compression Algorithms Approaching the Bound
Huffman coding, introduced in 1952, constructs a binary tree assigning short codes to frequent symbols and long codes to rare ones. For a source with symbols A (50%), B (30%), C (15%), and D (5%), Huffman assigns: A=‘0’ (1 bit), B=‘10’ (2 bits), C=‘110’ (3 bits), D=‘111’ (3 bits). Average length: 0.5×1 + 0.3×2 + 0.15×3 + 0.05×3 = 1.65 bits, compared to 2 bits for fixed-length encoding. Huffman coding is optimal for integer-length codes but cannot achieve fractional bit lengths when probabilities aren’t powers of 1/2.
Arithmetic coding overcomes this limitation by encoding an entire message as a single interval within [0,1), narrowing the interval based on cumulative symbol probabilities. It can achieve entropy exactly, squeezing out every last bit of redundancy permitted by the distribution. Arithmetic coding underpins many modern compression standards.
Lempel-Ziv algorithms (LZ77, LZ78) take a different approach: dictionary-based compression. Instead of encoding individual symbols, they replace repeated substrings with references to earlier occurrences. Encountering “the cat in the hat,” LZ algorithms recognize that “the” appears twice and encode the second occurrence as a pointer to the first. ZIP and GZIP use DEFLATE, which combines LZ77 with Huffman coding—first finding repetitions, then entropy-coding the result.
Modern lossless compressors like PPM (Prediction by Partial Matching) and PAQ use context modeling: they predict the next symbol based on the preceding k symbols, maintaining separate probability distributions for each context, and encode using arithmetic coding. These systems approach entropy remarkably closely for text and structured data.
Lossy compression ventures beyond the entropy bound by discarding information humans cannot perceive. JPEG quantizes high-frequency DCT coefficients in images, exploiting the eye’s reduced sensitivity to fine detail. MP3 uses psychoacoustic models to mask frequencies the ear cannot detect in the presence of louder tones. Video codecs like H.264 and HEVC combine motion compensation (predicting frames from previous frames) with transform coding and entropy coding. These methods achieve dramatic compression ratios—50:1 or higher—by accepting controlled information loss below perceptual thresholds.
The Limit and Its Implications
Every digital system relies on compression. The internet compresses HTTP traffic with gzip. Images use JPEG, PNG, or WebP. Video streaming on YouTube and Netflix depends on H.264, H.265, VP9, or AV1. Audio employs MP3, AAC, or Opus. File archives use ZIP, 7z, or RAR. Databases compress columnar data. Compression is ubiquitous because bandwidth and storage cost money, and entropy dictates the minimum cost.
But my theorem imposes absolute limits. You cannot compress truly random data—it already has maximum entropy. You cannot compress encrypted data—good encryption is indistinguishable from randomness. You cannot compress a file that’s already compressed; attempting to compress a ZIP file typically enlarges it because no redundancy remains.
The connection to Kolmogorov complexity deepens the insight: the Kolmogorov complexity K(x) of a string x is the length of the shortest program that generates x. For an ensemble of strings, the expected Kolmogorov complexity equals entropy. Compression and minimal description are two facets of the same concept.
Information theory unifies seemingly disparate domains. Thermodynamic entropy and information entropy share not just a name but a deep mathematical structure—both measure the number of states consistent with macroscopic constraints. Channel capacity sets limits on reliable communication just as source coding sets limits on compression. Cryptography’s perfect secrecy requires key entropy matching message entropy. Statistics identifies sufficient statistics—compressed representations retaining all relevant information.
Compression is the art of removing predictability while preserving information. It reveals that information is not the message itself, but the resolution of uncertainty. Redundancy is not waste—it enables error correction and fault tolerance. But when bandwidth is scarce and storage expensive, redundancy becomes a luxury. The entropy bound tells us precisely how much luxury we can afford to eliminate.
Source Notes
7 notes from 3 channels
Source Notes
7 notes from 3 channels