AiAdvanced25 min readUpdated March 2025

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are the dominant architecture for image recognition. This article covers convolution operations, pooling, feature maps, and landmark CNN architectures.

Why CNNs for Images?

A 224x224 RGB image has 224 * 224 * 3 = 150,528 pixels. If we used a fully connected layer, each neuron would need 150,528 weights - leading to millions of parameters and severe overfitting.

CNNs solve this with three key ideas: 1. Local connectivity - Each neuron only connects to a small region (receptive field). 2. Weight sharing - The same filter is applied across the entire image. 3. Hierarchical features - Early layers detect edges; deeper layers detect shapes and objects.

The Convolution Operation

A convolution slides a small filter (kernel) across the input image, computing dot products at each position. This produces a feature map highlighting where the filter pattern appears.

Key parameters: - Filter size: Typically 3x3 or 5x5 - Stride: How many pixels to move the filter each step - Padding: Adding zeros around the border to control output size - Number of filters: Each filter detects a different feature

python

import numpy as np

def convolve2d(image, kernel, stride=1, padding=0):
    """Manual 2D convolution (single channel)."""
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    
    H, W = image.shape
    kH, kW = kernel.shape
    
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    output = np.zeros((out_H, out_W))
    
    for i in range(0, out_H):
        for j in range(0, out_W):
            region = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
            output[i, j] = np.sum(region * kernel)
    
    return output

# Example: Edge detection with Sobel filter
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1],  # Horizontal edge here
    [1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1],
], dtype=float)

# Sobel horizontal edge detector
sobel_x = np.array([
    [-1, -2, -1],
    [ 0,  0,  0],
    [ 1,  2,  1],
])

feature_map = convolve2d(image, sobel_x, padding=1)
print("Feature map (edge detection):")
print(feature_map.astype(int))

Pooling Layers

Pooling layers reduce spatial dimensions, decreasing computation and providing translation invariance.

Max Pooling: Takes the maximum value in each region - preserves the most prominent feature. Average Pooling: Takes the average - smoother, used in some architectures.

A 2x2 max pool with stride 2 halves both height and width.

python

import numpy as np

def max_pool2d(feature_map, pool_size=2, stride=2):
    """2D Max Pooling."""
    H, W = feature_map.shape
    out_H = (H - pool_size) // stride + 1
    out_W = (W - pool_size) // stride + 1
    output = np.zeros((out_H, out_W))
    
    for i in range(out_H):
        for j in range(out_W):
            region = feature_map[i*stride:i*stride+pool_size, j*stride:j*stride+pool_size]
            output[i, j] = np.max(region)
    
    return output

feature_map = np.array([
    [1, 3, 2, 4],
    [5, 6, 1, 2],
    [3, 2, 8, 1],
    [4, 1, 3, 7],
])

pooled = max_pool2d(feature_map, pool_size=2, stride=2)
print("Original shape:", feature_map.shape)   # (4, 4)
print("After max pool:", pooled.shape)         # (2, 2)
print(pooled)
# [[6, 4],
#  [4, 8]]

CNN Architecture

A typical CNN follows this pattern:

``` Input -> [Conv -> ReLU -> Pool] x N -> Flatten -> FC -> Softmax ```

Building a CNN with PyTorch:

python

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        
        # Feature extraction
        self.features = nn.Sequential(
            # Block 1: 1x28x28 -> 32x26x26 -> 32x13x13
            nn.Conv2d(1, 32, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Block 2: 32x13x13 -> 64x11x11 -> 64x5x5
            nn.Conv2d(32, 64, kernel_size=3, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 5 * 5, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Model summary
model = SimpleCNN(num_classes=10)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")

# Test with dummy input (batch of 4 grayscale 28x28 images)
x = torch.randn(4, 1, 28, 28)
output = model(x)
print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")  # (4, 10)

Landmark CNN Architectures

The evolution of CNN architectures has driven dramatic improvements in image recognition:

LeNet-5 (1998) - First successful CNN; 5 layers; used for digit recognition.
AlexNet (2012) - 8 layers; won ImageNet with 15.3% error; sparked the deep learning revolution.
VGGNet (2014) - 16-19 layers; showed depth matters; used only 3x3 convolutions.
GoogLeNet/Inception (2014) - Inception modules; 22 layers; efficient multi-scale features.
ResNet (2015) - Residual connections; 152 layers; solved vanishing gradients.
EfficientNet (2019) - Compound scaling of depth/width/resolution; state-of-the-art efficiency.
Vision Transformer (ViT, 2020) - Applies Transformer architecture to image patches.

Key Takeaways

CNNs use local connectivity and weight sharing to efficiently process images.
Convolution filters slide across the image to produce feature maps detecting patterns.
Pooling layers reduce spatial dimensions and provide translation invariance.
Deep CNNs learn hierarchical features: edges -> shapes -> objects.
ResNet's residual connections solved the vanishing gradient problem for very deep networks.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com