Attention in transformers, step-by-step | Deep Learning Chapter 6

3blue1brown
Apr 7, 2024
8 notes
8 Notes in this Video

Attention Mechanism: Making Word Meanings Context-Dependent

Transformers DeepLearning NaturalLanguageProcessing
01:15

Transformer architects and natural language processing researchers use attention mechanisms to solve the fundamental problem that word embeddings initially assign the same vector to a word regardless of context.

Query-Key-Value Framework: Three Matrices for Attention

Transformers LinearAlgebra NeuralNetworks
04:22

Transformer architects implement attention through three learned weight matrices that transform each word’s embedding into query, key, and value vectors, analogous to database operations where queries search through keys to retrieve values.

Attention Patterns: Computing Query-Key Similarity Scores

Transformers DotProduct AttentionMechanism
08:15

Each word’s query vector computes similarity scores with every word’s key vector, creating an attention pattern matrix that reveals which words should influence each other’s representations.

Softmax Normalization: Converting Scores to Probability Distributions

Transformers Softmax ProbabilityDistribution
10:45

Softmax transforms raw attention scores into normalized weights that sum to one, creating interpretable probability distributions over which words receive attention.

Value Vectors: Weighted Sums Create Context-Enriched Representations

Transformers LinearCombination ContextAggregation
12:30

After attention weights are computed, value vectors get combined through weighted sums, aggregating contextual information from relevant words into each word’s updated representation.

Multi-Head Attention: Parallel Perspectives on Context

Transformers AttentionHeads ParallelProcessing
15:20

Transformers use multiple attention heads running in parallel, each with separate query-key-value matrices that learn to capture different types of relationships between words.

Context-Dependent Embeddings: From Static Vectors to Dynamic Meanings

Transformers WordEmbeddings ContextualRepresentation
18:05

Attention mechanisms transform static word embeddings into context-dependent representations where the same word receives different vector representations depending on surrounding context.

Positional Encoding: Injecting Word Order into Transformers

Transformers PositionalEncoding SequenceOrder
20:45

Transformers require explicit positional encoding because attention operations are permutation-invariant: without position information, “dog bites man” and “man bites dog” would produce identical representations.