Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are the dominant architecture for image recognition. This article covers convolution operations, pooling, feature maps, and landmark CNN architectures.
Why CNNs for Images?
A 224x224 RGB image has 224 * 224 * 3 = 150,528 pixels. If we used a fully connected layer, each neuron would need 150,528 weights - leading to millions of parameters and severe overfitting.
CNNs solve this with three key ideas: 1. Local connectivity - Each neuron only connects to a small region (receptive field). 2. Weight sharing - The same filter is applied across the entire image. 3. Hierarchical features - Early layers detect edges; deeper layers detect shapes and objects.
The Convolution Operation
A convolution slides a small filter (kernel) across the input image, computing dot products at each position. This produces a feature map highlighting where the filter pattern appears.
Key parameters: - Filter size: Typically 3x3 or 5x5 - Stride: How many pixels to move the filter each step - Padding: Adding zeros around the border to control output size - Number of filters: Each filter detects a different feature
import numpy as np
def convolve2d(image, kernel, stride=1, padding=0):
"""Manual 2D convolution (single channel)."""
if padding > 0:
image = np.pad(image, padding, mode='constant')
H, W = image.shape
kH, kW = kernel.shape
out_H = (H - kH) // stride + 1
out_W = (W - kW) // stride + 1
output = np.zeros((out_H, out_W))
for i in range(0, out_H):
for j in range(0, out_W):
region = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
output[i, j] = np.sum(region * kernel)
return output
# Example: Edge detection with Sobel filter
image = np.array([
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1], # Horizontal edge here
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
], dtype=float)
# Sobel horizontal edge detector
sobel_x = np.array([
[-1, -2, -1],
[ 0, 0, 0],
[ 1, 2, 1],
])
feature_map = convolve2d(image, sobel_x, padding=1)
print("Feature map (edge detection):")
print(feature_map.astype(int))Pooling Layers
Pooling layers reduce spatial dimensions, decreasing computation and providing translation invariance.
Max Pooling: Takes the maximum value in each region - preserves the most prominent feature. Average Pooling: Takes the average - smoother, used in some architectures.
A 2x2 max pool with stride 2 halves both height and width.
import numpy as np
def max_pool2d(feature_map, pool_size=2, stride=2):
"""2D Max Pooling."""
H, W = feature_map.shape
out_H = (H - pool_size) // stride + 1
out_W = (W - pool_size) // stride + 1
output = np.zeros((out_H, out_W))
for i in range(out_H):
for j in range(out_W):
region = feature_map[i*stride:i*stride+pool_size, j*stride:j*stride+pool_size]
output[i, j] = np.max(region)
return output
feature_map = np.array([
[1, 3, 2, 4],
[5, 6, 1, 2],
[3, 2, 8, 1],
[4, 1, 3, 7],
])
pooled = max_pool2d(feature_map, pool_size=2, stride=2)
print("Original shape:", feature_map.shape) # (4, 4)
print("After max pool:", pooled.shape) # (2, 2)
print(pooled)
# [[6, 4],
# [4, 8]]CNN Architecture
A typical CNN follows this pattern:
``` Input -> [Conv -> ReLU -> Pool] x N -> Flatten -> FC -> Softmax ```
Building a CNN with PyTorch:
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
# Feature extraction
self.features = nn.Sequential(
# Block 1: 1x28x28 -> 32x26x26 -> 32x13x13
nn.Conv2d(1, 32, kernel_size=3, padding=0),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
# Block 2: 32x13x13 -> 64x11x11 -> 64x5x5
nn.Conv2d(32, 64, kernel_size=3, padding=0),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
)
# Classifier
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 5 * 5, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
# Model summary
model = SimpleCNN(num_classes=10)
print(model)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")
# Test with dummy input (batch of 4 grayscale 28x28 images)
x = torch.randn(4, 1, 28, 28)
output = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}") # (4, 10)Landmark CNN Architectures
The evolution of CNN architectures has driven dramatic improvements in image recognition:
- LeNet-5 (1998) - First successful CNN; 5 layers; used for digit recognition.
- AlexNet (2012) - 8 layers; won ImageNet with 15.3% error; sparked the deep learning revolution.
- VGGNet (2014) - 16-19 layers; showed depth matters; used only 3x3 convolutions.
- GoogLeNet/Inception (2014) - Inception modules; 22 layers; efficient multi-scale features.
- ResNet (2015) - Residual connections; 152 layers; solved vanishing gradients.
- EfficientNet (2019) - Compound scaling of depth/width/resolution; state-of-the-art efficiency.
- Vision Transformer (ViT, 2020) - Applies Transformer architecture to image patches.
Key Takeaways
- CNNs use local connectivity and weight sharing to efficiently process images.
- Convolution filters slide across the image to produce feature maps detecting patterns.
- Pooling layers reduce spatial dimensions and provide translation invariance.
- Deep CNNs learn hierarchical features: edges -> shapes -> objects.
- ResNet's residual connections solved the vanishing gradient problem for very deep networks.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com