The F=ma of Artificial Intelligence [Backpropagation, How Models Learn Part 2]

Welch Labs

Jun 11, 2025

8 notes

8 Notes in this Video

Backpropagation as the F=ma of Artificial Intelligence
Linear Models as Neural Network Building Blocks
Softmax Function: Converting Neuron Outputs to Probabilities
Cross Entropy Loss: Measuring Model Confidence in Predictions
Gradient Descent: Walking Downhill on Loss Landscapes
Partial Derivatives: Computing How Parameters Affect Loss
Chain Rule: Composing Rates of Change Through Layers
Scalability: From Simple Planes to High-Dimensional Language Maps

Backpropagation as the F=ma of Artificial Intelligence

MachineLearning NeuralNetworks Mathematics

Harvard graduate student Paul Werbos discovered backpropagation in the early 1970s, comparing it to Newton’s laws as a fundamental principle. AI pioneer Marvin Minsky initially rejected the method, claiming it couldn’t learn difficult tasks. Despite skepticism, backpropagation proved itself by training models to drive cars in the 1980s, recognize handwritten digits in the 1990s, and classify images with incredible accuracy in the 2010s.

Linear Models as Neural Network Building Blocks

NeuralNetworks LinearAlgebra Mathematics

Each neuron in neural networks, from simple models to massive language models like Llama, operates as a basic linear model. Learners familiar with high school algebra recognize these as y equals mx plus b equations.

Softmax Function: Converting Neuron Outputs to Probabilities

NeuralNetworks Probability Mathematics

Neural network practitioners use softmax as the standard final layer activation function for classification tasks. Language models like Llama apply softmax to convert neuron outputs into probability distributions over thousands of possible next tokens.

Cross Entropy Loss: Measuring Model Confidence in Predictions

MachineLearning InformationTheory LossFunction

Cross entropy loss is the metric of choice for Llama and many modern AI models. It provides the single number that training algorithms attempt to minimize through gradient descent.

Gradient Descent: Walking Downhill on Loss Landscapes

Optimization MachineLearning GradientDescent

Gradient descent is the optimization process used by all modern AI training systems, from simple models to massive language models. Researchers visualize this as walking downhill on complex loss landscapes, though this metaphor provides only an incomplete picture of how models actually learn.

Partial Derivatives: Computing How Parameters Affect Loss

Calculus Optimization MachineLearning

Backpropagation computes partial derivatives for every parameter in neural networks, from tiny models with six parameters to massive models with billions. These derivatives form the gradient vector that drives all modern AI training.

Chain Rule: Composing Rates of Change Through Layers

Calculus NeuralNetworks Mathematics

The chain rule from calculus, applied by Werbos and later researchers, enables efficient computation of gradients in neural networks. Bernard Widrow’s group at Stanford in the 1950s used numerical slope estimates for years until Widrow and graduate student Ted Hoff stumbled onto an early version of backpropagation in 1959, though they failed to extend it to multi-layer networks.

Scalability: From Simple Planes to High-Dimensional Language Maps

NeuralNetworks Scalability HighDimensional

Marvin Minsky dismissed backpropagation in the 1970s, claiming it couldn’t learn difficult tasks and converged too slowly. He enormously underestimated the algorithm’s ability to scale, while limited compute power made his speed concerns temporarily valid. Modern researchers now train models with billions of parameters using the same fundamental algorithm Minsky rejected.