AiIntermediate18 min readUpdated March 2025

Backpropagation Algorithm

Backpropagation is the algorithm that trains neural networks by computing gradients of the loss function with respect to each weight using the chain rule of calculus.

The Core Problem: How Does a Network Learn?

A neural network starts with random weights. After a forward pass, it produces a prediction. The loss function measures how wrong the prediction is. The question is: how should each weight be adjusted to reduce the loss?

Backpropagation answers this by computing the gradient dL/dw for every weight w in the network - telling us the direction and magnitude to adjust each weight.

The Chain Rule of Calculus

Backpropagation is an application of the chain rule of calculus. For a composite function f(g(x)):

``` d/dx f(g(x)) = f'(g(x)) * g'(x) ```

In a neural network, the loss L depends on the output, which depends on hidden activations, which depend on weights. The chain rule lets us propagate gradients backward through this chain:

``` dL/dw1 = dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dw1 ```

Gradient Descent

Once gradients are computed, weights are updated using gradient descent:

``` w = w - alpha * dL/dw ```

Where alpha is the learning rate - a hyperparameter controlling step size.

Variants of gradient descent:

Batch Gradient Descent - Uses all training examples per update. Stable but slow.
Stochastic Gradient Descent (SGD) - Uses one example per update. Fast but noisy.
Mini-Batch SGD - Uses a small batch (32-256 examples). Best of both worlds.
Adam - Adaptive learning rates per parameter. Most popular optimizer today.
RMSprop - Adapts learning rate based on recent gradient magnitudes.

Implementing Backpropagation from Scratch

Let's implement a complete neural network with backpropagation for the XOR problem:

python

import numpy as np

class NeuralNetWithBackprop:
    """2-layer neural network trained with backpropagation."""
    
    def __init__(self, input_size, hidden_size, output_size, lr=0.1):
        np.random.seed(42)
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b2 = np.zeros((1, output_size))
        self.lr = lr
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def sigmoid_deriv(self, x):
        s = self.sigmoid(x)
        return s * (1 - s)
    
    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = np.maximum(0, self.z1)  # ReLU
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y):
        m = X.shape[0]
        # Output layer gradient
        dz2 = self.a2 - y
        dW2 = self.a1.T @ dz2 / m
        db2 = dz2.mean(axis=0, keepdims=True)
        # Hidden layer gradient
        da1 = dz2 @ self.W2.T
        dz1 = da1 * (self.z1 > 0)  # ReLU derivative
        dW1 = X.T @ dz1 / m
        db1 = dz1.mean(axis=0, keepdims=True)
        # Update weights
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1
    
    def train(self, X, y, epochs=1000):
        for epoch in range(epochs):
            self.forward(X)
            self.backward(X, y)
            if epoch % 200 == 0:
                loss = -np.mean(y * np.log(self.a2 + 1e-8) + (1-y) * np.log(1-self.a2 + 1e-8))
                print(f"Epoch {epoch}: Loss = {loss:.4f}")

# XOR problem
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

net = NeuralNetWithBackprop(input_size=2, hidden_size=4, output_size=1, lr=0.5)
net.train(X, y, epochs=1000)
preds = (net.forward(X) > 0.5).astype(int)
print("XOR predictions:", preds.flatten())
# Output: [0 1 1 0]

The Vanishing Gradient Problem

In deep networks, gradients can become extremely small as they propagate backward through many layers. This is the vanishing gradient problem:

- Sigmoid and tanh saturate (output near 0 or 1), producing near-zero derivatives. - Multiplying many small numbers -> gradient approaches zero. - Early layers learn extremely slowly or not at all.

Solutions:

Use ReLU activation (gradient = 1 for positive inputs, never saturates).
Batch Normalization - Normalizes layer inputs to prevent saturation.
Residual connections (ResNets) - Skip connections allow gradients to flow directly.
Careful weight initialization (Xavier, He initialization).
Gradient clipping - Cap gradient magnitude to prevent explosion.

Key Takeaways

Backpropagation computes gradients of the loss w.r.t. every weight using the chain rule.
Gradient descent updates weights in the direction that reduces the loss.
Mini-batch SGD and Adam are the most commonly used optimizers in practice.
The vanishing gradient problem causes early layers to learn slowly in deep networks.
ReLU, batch normalization, and residual connections mitigate vanishing gradients.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com