Dark Matter of Interpretability
Chris Olah coined this analogy to describe features researchers haven’t extracted from large language models despite knowing they exist within trained networks.
Residual Stream in Transformer Architecture
Transformer architectures like Google’s Gemma use residual streams to incrementally transform input representations through 26 layers of attention and multi-layer perceptron blocks.
Instruction Tuning and Model Alignment
AI labs like Google, OpenAI, and Anthropic use instruction tuning as post-training steps to align base models with expected AI assistant behaviors through specific examples and reinforcement learning from human feedback.
Mechanistic Interpretability: Opening the Black Box
Chris Olah popularized mechanistic interpretability as a research paradigm, with teams at Anthropic, OpenAI, and Google DeepMind making substantial progress extracting human-understandable features from language models.
Polysemanticity in Neural Networks
Researchers studying mechanistic interpretability discovered polysemanticity when investigating why individual neurons in language models respond to seemingly unrelated concepts, unlike vision models where neurons cleanly detect faces or cars.
Superposition Hypothesis in Language Models
The Anthropic team published this hypothesis in 2022 to explain why language models exhibit polysemanticity more than vision models despite similar architectures.
Sparse Autoencoders for Feature Extraction
Mechanistic interpretability researchers developed sparse autoencoders to extract human-understandable features from language models. Google DeepMind released Gemma Scope with over 400 trained sparse autoencoders, while Anthropic and OpenAI scaled to millions of features.
Feature Steering for Model Control
Researchers using mechanistic interpretability discovered they can directly control model behavior by clamping feature values extracted from sparse autoencoders, demonstrated with Anthropic’s Claude and Google’s Gemma models.