The Symmetries of Learning: Invariant Representations

Albert Einstein Examining science
GeneralRelativity QuantumMechanics Symmetry Observation
Outline

The Symmetries of Learning: Invariant Representations

Let me tell you about symmetries. When I developed special relativity, the key insight was not about light or time—it was about invariance. Physical laws must be the same for all inertial observers. Maxwell’s equations don’t change if you’re moving at constant velocity. This symmetry—Lorentz invariance—forced the rest of the theory.

Later, Emmy Noether proved something profound: every symmetry corresponds to a conservation law. Time translation symmetry yields energy conservation. Spatial translation symmetry yields momentum conservation. Rotation symmetry yields angular momentum conservation. Symmetries aren’t just mathematical elegance—they’re the structure of physical law.

Now I observe neural networks, and I see them learning the same principle. What makes a good representation? Invariance to irrelevant transformations.

Consider: You show a network a photo of a cat. Then you show it the same photo shifted three pixels left. Should the network’s ‘catness detector’ output the same value? Yes—translation shouldn’t affect cat identity.

Or: rotate the photo 15 degrees. Still a cat? Yes—rotation shouldn’t affect cat identity.

Or: adjust the lighting, crop the image, flip it horizontally. All irrelevant to catness. The network must learn which transformations preserve class identity and which destroy it.

This is exactly how I thought about physical laws. Which coordinate transformations leave equations invariant? Those transformations reveal what’s fundamental versus what’s arbitrary choice of description.

How Networks Learn Invariance

Networks learn invariances through two mechanisms: architecture and data.

Architectural invariance—built in by design:

Convolutional networks use translation-equivariant filters. The same edge detector is applied at every spatial position. This hardwires translation symmetry: if you shift the input, the feature map shifts correspondingly. The final classification becomes translation-invariant through pooling—you average over spatial positions, discarding location information.

This is like how physics hardwires certain symmetries. Maxwell’s equations are written in vector notation, automatically making them rotation-invariant. The mathematical structure enforces the symmetry. You don’t need experiments to verify that rotated electromagnetic fields obey the same laws—the tensor formulation guarantees it. Similarly, you don’t need to test whether a convolutional network responds to cats in all image positions—the weight-sharing architecture guarantees translation equivariance.

Transformers use position-independent attention mechanisms—the same operation applied regardless of token position—creating permutation equivariance. Order is encoded separately through positional embeddings, making it an explicit choice rather than implicit bias. Multiple attention heads run in parallel, each discovering different symmetries: syntactic dependencies, semantic associations, negation scope. The architecture provides the framework; data reveals which patterns remain invariant.

Data-driven invariance—learned through training:

Data augmentation teaches invariances not hardwired architecturally. Show the network:

  • Original image: cat, rotated 0 degrees
  • Augmented: same cat, rotated 15 degrees
  • Augmented: same cat, rotated -10 degrees
  • Augmented: same cat, zoomed 1.2x
  • Augmented: same cat, brightness +20%

All labeled ‘cat.’ The network learns: these transformations are symmetries—they don’t change class identity. The representation becomes invariant to rotation, scale, lighting because the training data exhibits this symmetry structure.

This mirrors how physicists discover symmetries empirically. We rotate apparatus, boost velocity, shift time origin—and observe that measured quantities don’t change. This experimental symmetry reveals conservation laws. The Michelson-Morley experiment showed light speed invariance under velocity boosts—a symmetry that demolished the luminiferous ether and birthed special relativity.

Neural networks do the same through data augmentation: apply transformations, demand consistent output, learn what’s conserved (catness) versus what’s irrelevant (viewing angle). Each augmented example is an experiment testing whether a transformation preserves the essential property. Enough experiments—enough augmented training examples—and the network discovers the symmetry group of the concept.

The Deeper Principle: Why Invariance Enables Generalization

Here’s why this matters. Generalization is the central problem in machine learning: perform well on unseen data. Invariance is the solution.

Think about what ‘unseen data’ means. Training images show cats at certain positions, angles, lighting conditions. Test images show cats at different positions, angles, lighting. If the network memorized pixel patterns, it fails—test images have different pixels.

But if the network learned invariant features—catness independent of presentation—it generalizes. The symmetries learned from training data apply to test data. You’ve abstracted from specific instances to the equivalence class under transformations—precisely what we mean by a physical law. Newton didn’t memorize individual planetary observations; he found equations invariant across all solar system configurations.

This is exactly why physical laws generalize. Newton’s laws work in Paris and Tokyo, in 1700 and 2025. Why? Because they’re symmetric under spatial and temporal translations. The laws don’t care where or when—they’ve abstracted the essential dynamics from arbitrary coordinate choices.

Similarly, a network that learns true invariances has abstracted essential features (cat ears, whiskers, body shape) from irrelevant presentation details (position, orientation, color balance). Test images present cats differently, but the essence remains—and the invariant detector recognizes it.

Connection to Noether’s theorem:

When a network learns translation invariance, it’s implicitly learning a conservation law: ‘catness is conserved under spatial shifts.’ The feature detector’s response doesn’t dissipate as you move the image—it’s conserved, much as momentum is conserved under spatial translation in physics.

When it learns rotation invariance: ‘catness is conserved under orientation changes,’ analogous to how angular momentum is conserved under rotational symmetry.

This is not metaphor. Mathematically, an invariant representation is a quantity conserved under group transformations. Learning invariances is discovering conservation laws in data space. Just as Noether showed that symmetries of the Lagrangian yield conserved quantities in phase space, data augmentation reveals which quantities remain invariant—and therefore informative—across the manifold of possible observations. The grid cells in biological brains exhibit environment invariance—the same toroidal manifold structure emerges across different spaces, suggesting a reusable coordinate system that remains conserved across contexts.

Hierarchical learning builds these symmetries compositionally. Early layers detect edges invariant to small shifts. Middle layers combine edges into corners, invariant to larger transformations. Deep layers recognize faces, invariant to all presentation variations. Each level discovers what’s conserved—what remains when you apply the symmetry group of natural image transformations.

The Limits of Invariance

But not all transformations should be symmetries. Networks must learn which invariances matter.

Example: Mirror reflection. Horizontally flip a cat photo—still a cat (reflection symmetry). Flip a written sentence—meaning destroyed (no reflection symmetry). The network must learn: for images, horizontal flip is often symmetry; for text, it’s not.

Example: Scale. Zoom in on a cat—still recognizable (approximate scale invariance). Zoom in on a digit ‘6’ until you only see part of the curve—could be ‘8’, ‘0’, or ‘6’ (scale invariance breaks at extremes).

Example: Position in transformers. Attention is permutation-invariant by design—every token position receives equal treatment. Yet word order matters: “dog bites man” differs from “man bites dog.” Solution: explicitly break the symmetry through positional encoding. You inject position-specific patterns, teaching the network which permutations preserve meaning and which don’t.

This is like physics. Lorentz symmetry holds at velocities below light speed—breaks at v = c. Rotational symmetry holds in isotropic space—breaks in crystals with preferred axes.

The art is learning which transformations are symmetries for which tasks. This is where data augmentation becomes hypothesis: we claim ‘rotation shouldn’t matter for cat classification’ by including rotated examples. If we’re wrong—if cat breeds show orientation preferences—the symmetry is spurious and the network learns worse representations.

Physics taught me: symmetry assumptions must match reality. Assume wrong symmetries, derive wrong laws. I assumed the principle of relativity—that physical laws are identical in all inertial frames—and derived Lorentz transformations. Had I assumed absolute simultaneity (a false symmetry), the theory would have failed. Similarly, if you augment text data with random word permutations, assuming order doesn’t matter, your language model will be incoherent. The symmetries you impose must reflect genuine invariances in the world you’re modeling.

What This Reveals

For machine learning: Stop thinking about networks learning patterns. They learn symmetries. Good networks discover which transformations preserve class identity.

For architecture design: Build symmetries into structure when you know they’re fundamental (convolution for translation, self-attention for permutation). Let data augmentation teach symmetries you’re uncertain about.

For understanding generalization: A network generalizes when it’s learned the true symmetries of the problem. Overfitting is learning spurious symmetries—memorizing training-specific noise as if it were conserved quantity. The mystery of why overparameterized networks generalize despite perfect training accuracy resolves when you recognize: they’ve discovered genuine invariances, not merely fitted data. Consider this: a network with millions of parameters can memorize every training example, yet chooses not to. Why? Because the architecture’s inductive bias, combined with gradient descent dynamics, naturally finds solutions respecting the symmetries of natural data. The network discovers what physicists seek: the simplest laws compatible with observations.

For AI alignment: Humans recognize objects through invariances. We know a chair rotated is still a chair. For AI to align with human concepts, it must learn human-relevant symmetries.

The unifying insight:

When I sought unified field theory, I was searching for symmetries unifying electromagnetism, gravity, quantum mechanics. I never found it. But perhaps the lesson is: unity comes from recognizing that learning—whether by evolution, brains, or neural networks—is symmetry detection.

The universe computes through conservation laws arising from symmetries. Intelligence computes through invariant representations arising from symmetries. Different manifestations, same principle.

Symmetry is not decoration. It’s the structure of understanding.

Source Notes

11 notes from 5 channels