The Dark Matter of AI [Mechanistic Interpretability]

Welch Labs

Dec 23, 2024

8 notes

8 Notes in this Video

Dark Matter of Interpretability
Residual Stream in Transformer Architecture
Instruction Tuning and Model Alignment
Mechanistic Interpretability: Opening the Black Box
Polysemanticity in Neural Networks
Superposition Hypothesis in Language Models
Sparse Autoencoders for Feature Extraction
Feature Steering for Model Control

Dark Matter of Interpretability

MachineLearning NeuralNetworks Interpretability

Chris Olah coined this analogy to describe features researchers haven’t extracted from large language models despite knowing they exist within trained networks.

Residual Stream in Transformer Architecture

TransformerArchitecture DeepLearning NeuralNetworks

Transformer architectures like Google’s Gemma use residual streams to incrementally transform input representations through 26 layers of attention and multi-layer perceptron blocks.

Instruction Tuning and Model Alignment

ModelAlignment AISafety MachineLearning

AI labs like Google, OpenAI, and Anthropic use instruction tuning as post-training steps to align base models with expected AI assistant behaviors through specific examples and reinforcement learning from human feedback.

Mechanistic Interpretability: Opening the Black Box

Interpretability AISafety MachineLearning

Chris Olah popularized mechanistic interpretability as a research paradigm, with teams at Anthropic, OpenAI, and Google DeepMind making substantial progress extracting human-understandable features from language models.

Polysemanticity in Neural Networks

NeuralNetworks Representation MachineLearning

Researchers studying mechanistic interpretability discovered polysemanticity when investigating why individual neurons in language models respond to seemingly unrelated concepts, unlike vision models where neurons cleanly detect faces or cars.

Superposition Hypothesis in Language Models

NeuralNetworks Representation InformationTheory

The Anthropic team published this hypothesis in 2022 to explain why language models exhibit polysemanticity more than vision models despite similar architectures.

Sparse Autoencoders for Feature Extraction

MachineLearning Autoencoders FeatureExtraction

Mechanistic interpretability researchers developed sparse autoencoders to extract human-understandable features from language models. Google DeepMind released Gemma Scope with over 400 trained sparse autoencoders, while Anthropic and OpenAI scaled to millions of features.

Feature Steering for Model Control

AISafety ModelControl MachineLearning

Researchers using mechanistic interpretability discovered they can directly control model behavior by clamping feature values extracted from sparse autoencoders, demonstrated with Anthropic’s Claude and Google’s Gemma models.