Mutual Information: What Signals Share

How related are two variables? The correlation coefficient offers one answer, measuring linear relationships between continuous quantities. But correlation misses nonlinear dependencies, categorical variables, discrete distributions. I needed something more general—a measure capturing any statistical relationship between variables, whether linear, nonlinear, deterministic, or probabilistic.

Mutual information I(X;Y) provides that measure. It quantifies precisely: how much does learning X reduce my uncertainty about Y? This question applies universally—to continuous signals, discrete symbols, probability distributions of any form. The answer comes in bits: how many binary questions does knowing X save me when determining Y?

Information as Uncertainty Reduction

Let me establish the foundation. Entropy H(X) measures uncertainty in variable X before observing it. For a fair coin flip, H(X) = 1 bit—maximum uncertainty between two equally likely outcomes. For a loaded coin landing heads with probability 0.9, entropy drops below 1 bit—less uncertainty since one outcome dominates.

Conditional entropy H(X|Y) measures uncertainty in X after observing Y. If Y tells you nothing about X, then H(X|Y) = H(X)—your uncertainty remains unchanged. If Y tells you everything about X, then H(X|Y) = 0—no uncertainty remains after observing Y.

The difference quantifies what Y reveals about X:

I(X;Y) = H(X) - H(X|Y)

This is mutual information: the reduction in uncertainty about X achieved by learning Y. It measures how much information the variables share.

Notice the symmetry: I(X;Y) = I(Y;X). The information X provides about Y equals the information Y provides about X. This isn’t obvious from the definition, but follows from the mathematics. Knowledge flows both ways equally.

I can also write mutual information as:

I(X;Y) = H(X) + H(Y) - H(X,Y)

Here H(X,Y) is joint entropy—uncertainty about the pair (X,Y) together. Think of a Venn diagram: H(X) and H(Y) are two circles. Their overlap represents shared information I(X;Y). The formula follows: total entropy of both variables minus their joint entropy equals the overlap.

This formulation reveals mutual information as information shared between variables. It’s the entropy they have in common.

When Variables Are Independent or Determined

Mutual information ranges from zero to the entropy of whichever variable has less:

0 ≤ I(X;Y) ≤ min(H(X), H(Y))

The extremes clarify what mutual information measures.

I(X;Y) = 0: X and Y are statistically independent. Knowing one tells you absolutely nothing about the other. Observing X leaves H(Y|X) = H(Y)—uncertainty unchanged. Examples: independent coin flips, unrelated measurements, separate random processes.

I(X;Y) = H(Y): X completely determines Y. Knowing X eliminates all uncertainty about Y, making H(Y|X) = 0. Examples: Y = X (perfect copy), Y = f(X) for deterministic function f, ciphertext-to-plaintext mapping with known key.

Between these extremes lies partial dependence: 0 < I(X;Y) < H(Y). Variables are related but not deterministically. X reduces uncertainty about Y without eliminating it completely. This captures probabilistic relationships, noisy channels, correlated-but-not-identical variables.

Consider concrete examples:

Independent coin flips: X = first flip, Y = second flip. Each has entropy 1 bit. But H(X,Y) = 2 bits—joint entropy is simply the sum. Therefore I(X;Y) = 1 + 1 - 2 = 0. The flips share no information.

Perfect copy: X = coin flip, Y = X exactly. H(X) = 1 bit, H(Y) = 1 bit. But H(X,Y) = 1 bit—knowing the pair is equivalent to knowing just X or just Y. Therefore I(X;Y) = 1 + 1 - 1 = 1 bit. Y is completely determined by X.

Dice parity: X = six-sided die roll (2.58 bits), Y = even/odd indicator (1 bit). Y partitions X’s six outcomes into two groups. H(Y|X) = 0 since knowing X tells you Y exactly. Therefore I(X;Y) = H(Y) - 0 = 1 bit. Despite X having 2.58 bits entropy, it shares only 1 bit with Y—the even/odd distinction.

Noisy channel: X = transmitted bit, Y = received bit with 10% error probability. Most information transfers correctly, but noise corrupts some. I(X;Y) < H(X) because errors destroy information. Channel capacity is exactly max I(X;Y)—the maximum mutual information achievable over input distributions.

The dice parity example reveals something important: mutual information captures relevant information. X contains 2.58 bits total, but only 1 bit matters for predicting Y. Mutual information isolates that relevant subset.

Beyond Linear Correlation

Why not just use correlation coefficient ρ? Because correlation only measures linear relationships between continuous variables. Mutual information measures all statistical dependence for any variable types.

Consider Y = X². Perfect deterministic relationship—Y is completely determined by X. But if X has symmetric distribution around zero (like Gaussian), the correlation coefficient ρ(X,Y) = 0. Correlation sees no linear relationship and concludes independence. Wrong.

Mutual information correctly identifies I(X;Y) = H(Y). Knowing X tells you Y exactly, regardless of whether the relationship is linear.

Or consider discrete variables: X = disease status (present/absent), Y = test result (positive/negative). Correlation doesn’t apply directly—these aren’t continuous quantities. But mutual information works perfectly: I(X;Y) quantifies how much the test result informs disease diagnosis.

Mutual information catches:

Nonlinear relationships (polynomials, exponentials, modular arithmetic)
Discrete variable dependencies (categorical features, symbols, states)
Mixed discrete-continuous relationships (continuous signal with discrete labels)
Any statistical dependence structure captured by joint probability distribution

The only requirement: X and Y must have well-defined probability distributions. Then I(X;Y) measures their statistical relationship, whatever form it takes.

But mutual information doesn’t tell you causation direction. I(X;Y) = I(Y;X)—the formula is symmetric. If fire and smoke have high mutual information, this tells you they’re strongly related but not whether fire causes smoke or smoke causes fire. You need temporal information, intervention experiments, or causal modeling for that. Mutual information only quantifies association, not causation.

Where This Matters

Mutual information appears throughout information theory and its applications.

Channel capacity: The maximum rate of reliable communication through a noisy channel is C = max I(X;Y), maximized over all possible input distributions. This is the Shannon capacity formula’s information-theoretic core. Channel coding theorem says you can communicate reliably at any rate below C by using error-correcting codes. Mutual information between input and output signals determines fundamental limits.

Feature selection in machine learning: Given features X₁, X₂, …, Xₙ and target label Y, which features are most relevant? Those with highest I(Xᵢ;Y). Mutual information quantifies relevance—features sharing more information with the label provide more predictive power. This guides dimensionality reduction and removes irrelevant features.

Neuroscience: How much information transfers between brain regions? Record neural activity X in one region and Y in another. I(X;Y) measures information flow. High mutual information suggests functional connectivity. Transfer entropy extends this to directed information flow over time.

Information bottleneck: Compress representation X into Z while preserving information about target Y. Optimize trade-off: minimize I(X;Z) (compression) while maximizing I(Z;Y) (retain relevant information). This principle explains deep learning representations—networks compress inputs while preserving task-relevant information.

Genetics: Gene expression levels across samples. I(Xᵢ;Xⱼ) measures co-expression between genes i and j. High mutual information suggests regulatory relationship, shared pathway, or functional connection. Builds gene co-expression networks from correlation-like dependencies without assuming linearity.

Cryptography: Perfect secrecy requires I(M;C) = 0 where M is message and C is ciphertext. The ciphertext must reveal zero information about the message. One-time pad achieves this—ciphertext appears completely independent of plaintext to any observer lacking the key.

Any time you ask “how related are these variables?”—mutual information provides a precise, quantitative answer measured in bits.

Quantifying Relevance

Mutual information solved a problem I encountered constantly: measuring relatedness rigorously. Correlation handles one special case (linear relationships between continuous variables), but real-world dependencies are messier—discrete, nonlinear, probabilistic.

Information theory provides the general solution. Variables share information when they reduce each other’s uncertainty. Quantify the reduction, and you quantify the relationship. The units are bits—binary questions saved by knowing one variable when determining the other.

The formulation is elegant. Mutual information is:

Symmetric: I(X;Y) = I(Y;X)—information sharing is mutual
Non-negative: I(X;Y) ≥ 0—can’t share negative information
Bounded: I(X;Y) ≤ min(H(X), H(Y))—can’t share more information than exists

These properties follow from entropy’s mathematics, making mutual information a well-behaved measure of statistical dependence.

Modern machine learning, neuroscience, genetics, and communications all rely on mutual information. When you need to quantify relevance, dependence, or relationship strength—and correlation isn’t enough—mutual information provides the answer.

It measures what signals share by measuring what they reveal about each other. Information is the resolution of uncertainty. Shared information is shared uncertainty reduction. Both measured precisely in bits.

This editorial synthesizes concepts of entropy as uncertainty measurement, information as selection and uncertainty reduction, probability distributions, information leakage as dependency revelation, and universal information metrics to examine mutual information I(X;Y) as the quantification of statistical relationships between variables.

Mutual Information: What Signals Share

Mutual Information: What Signals Share

Information as Uncertainty Reduction

When Variables Are Independent or Determined

Beyond Linear Correlation

Where This Matters

Quantifying Relevance

Source Notes

Redundancy Is a Feature – Error Detection and Correction

Can Machines Think? (Revisited)