Attention in transformers, step-by-step | Deep Learning Chapter 6

3blue1brown

Apr 7, 2024

8 notes

8 Notes in this Video

Attention Mechanism: Making Word Meanings Context-Dependent
Query-Key-Value Framework: Three Matrices for Attention
Attention Patterns: Computing Query-Key Similarity Scores
Softmax Normalization: Converting Scores to Probability Distributions
Value Vectors: Weighted Sums Create Context-Enriched Representations
Multi-Head Attention: Parallel Perspectives on Context
Context-Dependent Embeddings: From Static Vectors to Dynamic Meanings
Positional Encoding: Injecting Word Order into Transformers

Attention Mechanism: Making Word Meanings Context-Dependent

Transformers DeepLearning NaturalLanguageProcessing

Transformer architects and natural language processing researchers use attention mechanisms to solve the fundamental problem that word embeddings initially assign the same vector to a word regardless of context.

Query-Key-Value Framework: Three Matrices for Attention

Transformers LinearAlgebra NeuralNetworks

Transformer architects implement attention through three learned weight matrices that transform each word’s embedding into query, key, and value vectors, analogous to database operations where queries search through keys to retrieve values.

Attention Patterns: Computing Query-Key Similarity Scores

Transformers DotProduct AttentionMechanism

Each word’s query vector computes similarity scores with every word’s key vector, creating an attention pattern matrix that reveals which words should influence each other’s representations.

Softmax Normalization: Converting Scores to Probability Distributions

Transformers Softmax ProbabilityDistribution

Softmax transforms raw attention scores into normalized weights that sum to one, creating interpretable probability distributions over which words receive attention.

Value Vectors: Weighted Sums Create Context-Enriched Representations

Transformers LinearCombination ContextAggregation

After attention weights are computed, value vectors get combined through weighted sums, aggregating contextual information from relevant words into each word’s updated representation.

Multi-Head Attention: Parallel Perspectives on Context

Transformers AttentionHeads ParallelProcessing

Transformers use multiple attention heads running in parallel, each with separate query-key-value matrices that learn to capture different types of relationships between words.

Context-Dependent Embeddings: From Static Vectors to Dynamic Meanings

Transformers WordEmbeddings ContextualRepresentation

Attention mechanisms transform static word embeddings into context-dependent representations where the same word receives different vector representations depending on surrounding context.

Positional Encoding: Injecting Word Order into Transformers

Transformers PositionalEncoding SequenceOrder

Transformers require explicit positional encoding because attention operations are permutation-invariant: without position information, “dog bites man” and “man bites dog” would produce identical representations.