Training Data: Learning from Labeled Examples
The MNIST database provides tens of thousands of handwritten digit images, each labeled with the correct digit. The network learns by adjusting parameters to perform better on these labeled examples.
Random Initialization: Starting the Learning Process
Before training begins, all 13,000 weights and biases start with randomly chosen values rather than being set to zero or any particular pattern.
Cost Function: Measuring Neural Network Performance
The cost function operates on all 13,000 weights and biases of the neural network, evaluating the network’s performance across tens of thousands of training examples from the MNIST dataset.
Valley Descent Metaphor: Visualizing Gradient Descent
The ball-rolling-down-hill metaphor provides intuitive understanding of gradient descent for learners struggling with abstract optimization concepts in high-dimensional spaces.
Local Minima: Valley Destinations in Optimization
Local minima are points in the cost function landscape where gradient descent algorithms naturally settle. They represent stable configurations where small parameter changes would increase the cost.
Learning Rate: Step Size in Gradient Descent
The learning rate is a hyperparameter chosen by the practitioner before training begins. It controls how aggressively the gradient descent algorithm updates parameters at each iteration.
Gradient Vector: Direction of Steepest Ascent
The gradient vector emerges from multivariable calculus as the fundamental tool for understanding how functions change across multiple dimensions. It applies to cost functions with two inputs, 13,000 inputs, or any number of variables.
High-Dimensional Optimization: Beyond Visual Intuition
Neural network optimization operates in spaces with thousands or millions of dimensions, one for each weight and bias parameter that must be adjusted during training.
Backpropagation: Efficient Gradient Computation
Backpropagation is the algorithm that efficiently computes gradients for neural networks, making gradient descent practically feasible. It represents the computational heart of how neural networks learn.
Gradient Descent: Minimizing Functions Through Iterative Steps
Gradient descent is the fundamental algorithm used by neural networks and many machine learning systems to find optimal parameter values. It operates on cost functions with any number of inputs, from simple single-variable functions to the 13,000-dimensional parameter space of neural networks.
Smooth Cost Function: Why Neural Networks Use Continuous Activations
The requirement for a smooth, differentiable cost function drives fundamental design choices in artificial neural networks, particularly the use of continuous activation functions.
Convergence: Reaching Stable Network Performance
Convergence represents the endpoint of the training process when the network has learned as much as it can from the available data and further iterations yield diminishing returns.
Parameter Importance: Gradient Components as Relative Impact
The gradient vector encodes information about which of the 13,000 network parameters matter most for reducing the cost function at the current training state.
Parameter Adjustment: Nudging Weights and Biases
During each training iteration, all 13,000 weights and biases receive small adjustments based on their gradient components, simultaneously updating the entire network configuration.