AiAdvanced30 min readUpdated March 2025

Transformers & Attention

The Transformer architecture, introduced in "Attention Is All You Need" (2017), revolutionized AI. This article explains self-attention, multi-head attention, positional encoding, and how GPT and BERT are built on Transformers.

The Problem with RNNs

Before Transformers, Recurrent Neural Networks (RNNs) and LSTMs were the standard for sequence tasks. They had critical limitations:

- Sequential processing - Cannot parallelize; slow to train on long sequences. - Long-range dependencies - Information from early tokens fades over long sequences. - Vanishing gradients - Gradients diminish over long sequences.

The Transformer architecture (Vaswani et al., 2017) solved all three problems with a single mechanism: attention.

The Attention Mechanism

Attention allows every token in a sequence to directly attend to every other token, regardless of distance.

For each token, attention computes: 1. Query (Q) - What am I looking for? 2. Key (K) - What do I contain? 3. Value (V) - What information do I provide?

The attention score between token i and token j is:

``` Attention(Q, K, V) = softmax(Q * K_T / sqrt(d_k)) * V ```

The sqrt(d_k) scaling prevents dot products from growing too large in high dimensions.

python

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (seq_len, d_k)
    K: (seq_len, d_k)
    V: (seq_len, d_v)
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)  # (seq_len, seq_len)
    
    # Apply mask (for decoder self-attention)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)
    
    # Softmax to get attention weights
    scores_shifted = scores - scores.max(axis=-1, keepdims=True)
    exp_scores = np.exp(scores_shifted)
    attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
    
    # Weighted sum of values
    output = attention_weights @ V
    return output, attention_weights

# Example: 4 tokens, d_k=8
np.random.seed(42)
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row sums to 1):")
print(weights.round(3))
print("\nOutput shape:", output.shape)  # (4, 8)

Multi-Head Attention

Instead of computing attention once, Multi-Head Attention runs h parallel attention operations ("heads"), each with different learned projections:

``` MultiHead(Q,K,V) = Concat(head1,...,headh) * W_O where head_i = Attention(Q*W_i_Q, K*W_i_K, V*W_i_V) ```

Different heads can attend to different aspects of the sequence simultaneously - one head might focus on syntax, another on semantics.

python

import numpy as np

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Weight matrices (simplified - no bias)
        np.random.seed(42)
        self.W_Q = np.random.randn(d_model, d_model) * 0.1
        self.W_K = np.random.randn(d_model, d_model) * 0.1
        self.W_V = np.random.randn(d_model, d_model) * 0.1
        self.W_O = np.random.randn(d_model, d_model) * 0.1
    
    def split_heads(self, x, seq_len):
        """Reshape to (num_heads, seq_len, d_k)."""
        x = x.reshape(seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 0, 2)  # (num_heads, seq_len, d_k)
    
    def attention(self, Q, K, V):
        scores = Q @ K.transpose(0, 2, 1) / np.sqrt(self.d_k)
        scores_shifted = scores - scores.max(axis=-1, keepdims=True)
        weights = np.exp(scores_shifted)
        weights /= weights.sum(axis=-1, keepdims=True)
        return weights @ V
    
    def forward(self, x):
        seq_len = x.shape[0]
        Q = self.split_heads(x @ self.W_Q, seq_len)
        K = self.split_heads(x @ self.W_K, seq_len)
        V = self.split_heads(x @ self.W_V, seq_len)
        
        attended = self.attention(Q, K, V)  # (num_heads, seq_len, d_k)
        # Concatenate heads
        concat = attended.transpose(1, 0, 2).reshape(seq_len, self.d_model)
        return concat @ self.W_O

# Test
mha = MultiHeadAttention(d_model=64, num_heads=8)
x = np.random.randn(10, 64)  # 10 tokens, 64-dim embeddings
output = mha.forward(x)
print(f"Input shape:  {x.shape}")   # (10, 64)
print(f"Output shape: {output.shape}")  # (10, 64)

Positional Encoding

Unlike RNNs, Transformers process all tokens in parallel and have no inherent sense of order. Positional encoding injects position information into token embeddings:

``` PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) ```

These sinusoidal functions allow the model to learn relative positions and generalize to sequence lengths not seen during training.

The Full Transformer Architecture

The Transformer has an Encoder-Decoder structure:

Encoder (used in BERT): - N stacked encoder layers - Each layer: Multi-Head Self-Attention -> Add & Norm -> Feed-Forward -> Add & Norm

Decoder (used in GPT): - N stacked decoder layers - Each layer: Masked Self-Attention -> Cross-Attention -> Feed-Forward - Masked attention prevents attending to future tokens (autoregressive generation)

Key models built on Transformers:

BERT (2018) - Bidirectional encoder; pre-trained with masked language modeling; used for NLU tasks.
GPT series (2018-2024) - Decoder-only; autoregressive text generation; GPT-4 has ~1.8T parameters.
T5 (2019) - Encoder-decoder; frames all NLP tasks as text-to-text.
Vision Transformer (ViT, 2020) - Applies Transformer to image patches.
Gemini, Claude, LLaMA - Modern LLMs all built on the Transformer foundation.

Key Takeaways

Transformers solve the sequential processing, long-range dependency, and vanishing gradient problems of RNNs.
Attention allows every token to directly attend to every other token, making it parallelizable.
Positional encoding injects order information into token embeddings since Transformers process them in parallel.
GPT (decoder-only) and BERT (encoder-only) are the two dominant Transformer variants.
The Transformer architecture is the foundation of modern large language models.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com