Backpropagation Algorithm
Backpropagation is the algorithm that trains neural networks by computing gradients of the loss function with respect to each weight using the chain rule of calculus.
The Core Problem: How Does a Network Learn?
A neural network starts with random weights. After a forward pass, it produces a prediction. The loss function measures how wrong the prediction is. The question is: how should each weight be adjusted to reduce the loss?
Backpropagation answers this by computing the gradient dL/dw for every weight w in the network - telling us the direction and magnitude to adjust each weight.
The Chain Rule of Calculus
Backpropagation is an application of the chain rule of calculus. For a composite function f(g(x)):
``` d/dx f(g(x)) = f'(g(x)) * g'(x) ```
In a neural network, the loss L depends on the output, which depends on hidden activations, which depend on weights. The chain rule lets us propagate gradients backward through this chain:
``` dL/dw1 = dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dw1 ```
Gradient Descent
Once gradients are computed, weights are updated using gradient descent:
``` w = w - alpha * dL/dw ```
Where alpha is the learning rate - a hyperparameter controlling step size.
Variants of gradient descent:
- Batch Gradient Descent - Uses all training examples per update. Stable but slow.
- Stochastic Gradient Descent (SGD) - Uses one example per update. Fast but noisy.
- Mini-Batch SGD - Uses a small batch (32-256 examples). Best of both worlds.
- Adam - Adaptive learning rates per parameter. Most popular optimizer today.
- RMSprop - Adapts learning rate based on recent gradient magnitudes.
Implementing Backpropagation from Scratch
Let's implement a complete neural network with backpropagation for the XOR problem:
import numpy as np
class NeuralNetWithBackprop:
"""2-layer neural network trained with backpropagation."""
def __init__(self, input_size, hidden_size, output_size, lr=0.1):
np.random.seed(42)
self.W1 = np.random.randn(input_size, hidden_size) * 0.1
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.1
self.b2 = np.zeros((1, output_size))
self.lr = lr
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_deriv(self, x):
s = self.sigmoid(x)
return s * (1 - s)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = np.maximum(0, self.z1) # ReLU
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y):
m = X.shape[0]
# Output layer gradient
dz2 = self.a2 - y
dW2 = self.a1.T @ dz2 / m
db2 = dz2.mean(axis=0, keepdims=True)
# Hidden layer gradient
da1 = dz2 @ self.W2.T
dz1 = da1 * (self.z1 > 0) # ReLU derivative
dW1 = X.T @ dz1 / m
db1 = dz1.mean(axis=0, keepdims=True)
# Update weights
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def train(self, X, y, epochs=1000):
for epoch in range(epochs):
self.forward(X)
self.backward(X, y)
if epoch % 200 == 0:
loss = -np.mean(y * np.log(self.a2 + 1e-8) + (1-y) * np.log(1-self.a2 + 1e-8))
print(f"Epoch {epoch}: Loss = {loss:.4f}")
# XOR problem
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
net = NeuralNetWithBackprop(input_size=2, hidden_size=4, output_size=1, lr=0.5)
net.train(X, y, epochs=1000)
preds = (net.forward(X) > 0.5).astype(int)
print("XOR predictions:", preds.flatten())
# Output: [0 1 1 0]The Vanishing Gradient Problem
In deep networks, gradients can become extremely small as they propagate backward through many layers. This is the vanishing gradient problem:
- Sigmoid and tanh saturate (output near 0 or 1), producing near-zero derivatives. - Multiplying many small numbers -> gradient approaches zero. - Early layers learn extremely slowly or not at all.
Solutions:
- Use ReLU activation (gradient = 1 for positive inputs, never saturates).
- Batch Normalization - Normalizes layer inputs to prevent saturation.
- Residual connections (ResNets) - Skip connections allow gradients to flow directly.
- Careful weight initialization (Xavier, He initialization).
- Gradient clipping - Cap gradient magnitude to prevent explosion.
Key Takeaways
- Backpropagation computes gradients of the loss w.r.t. every weight using the chain rule.
- Gradient descent updates weights in the direction that reduces the loss.
- Mini-batch SGD and Adam are the most commonly used optimizers in practice.
- The vanishing gradient problem causes early layers to learn slowly in deep networks.
- ReLU, batch normalization, and residual connections mitigate vanishing gradients.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com