The Dark Matter of AI [Mechanistic Interpretability]

Welch Labs
Dec 23, 2024
8 notes
8 Notes in this Video

Dark Matter of Interpretability

MachineLearning NeuralNetworks Interpretability
01:26

Chris Olah coined this analogy to describe features researchers haven’t extracted from large language models despite knowing they exist within trained networks.

Residual Stream in Transformer Architecture

TransformerArchitecture DeepLearning NeuralNetworks
02:12

Transformer architectures like Google’s Gemma use residual streams to incrementally transform input representations through 26 layers of attention and multi-layer perceptron blocks.

Instruction Tuning and Model Alignment

ModelAlignment AISafety MachineLearning
02:36

AI labs like Google, OpenAI, and Anthropic use instruction tuning as post-training steps to align base models with expected AI assistant behaviors through specific examples and reinforcement learning from human feedback.

Mechanistic Interpretability: Opening the Black Box

Interpretability AISafety MachineLearning
03:04

Chris Olah popularized mechanistic interpretability as a research paradigm, with teams at Anthropic, OpenAI, and Google DeepMind making substantial progress extracting human-understandable features from language models.

Polysemanticity in Neural Networks

NeuralNetworks Representation MachineLearning
06:10

Researchers studying mechanistic interpretability discovered polysemanticity when investigating why individual neurons in language models respond to seemingly unrelated concepts, unlike vision models where neurons cleanly detect faces or cars.

Superposition Hypothesis in Language Models

NeuralNetworks Representation InformationTheory
06:28

The Anthropic team published this hypothesis in 2022 to explain why language models exhibit polysemanticity more than vision models despite similar architectures.

Sparse Autoencoders for Feature Extraction

MachineLearning Autoencoders FeatureExtraction
06:56

Mechanistic interpretability researchers developed sparse autoencoders to extract human-understandable features from language models. Google DeepMind released Gemma Scope with over 400 trained sparse autoencoders, while Anthropic and OpenAI scaled to millions of features.

Feature Steering for Model Control

AISafety ModelControl MachineLearning
09:05

Researchers using mechanistic interpretability discovered they can directly control model behavior by clamping feature values extracted from sparse autoencoders, demonstrated with Anthropic’s Claude and Google’s Gemma models.