AiIntermediate16 min readUpdated March 2025

Word Embeddings (Word2Vec)

Word embeddings represent words as dense numerical vectors that capture semantic meaning. This article covers one-hot encoding limitations, Word2Vec (CBOW and Skip-gram), and GloVe.

The Problem with One-Hot Encoding

The simplest way to represent words is one-hot encoding: a vector of zeros with a single 1 at the word's index.

Problems: - Dimensionality: Vocabulary of 50,000 words -> 50,000-dimensional vectors. - No semantic similarity: "cat" and "kitten" are as different as "cat" and "airplane". - Sparse: Almost all values are zero - computationally wasteful.

Word embeddings solve this by mapping words to dense, low-dimensional vectors (typically 100-300 dimensions) where semantically similar words are close together.

python

import numpy as np

# One-hot encoding (naive approach)
vocab = ["cat", "kitten", "dog", "airplane", "car"]
word_to_idx = {word: i for i, word in enumerate(vocab)}

def one_hot(word, vocab_size):
    vec = np.zeros(vocab_size)
    vec[word_to_idx[word]] = 1
    return vec

cat_vec = one_hot("cat", len(vocab))
kitten_vec = one_hot("kitten", len(vocab))
airplane_vec = one_hot("airplane", len(vocab))

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)

print("One-hot similarity:")
print(f"  cat vs kitten:   {cosine_similarity(cat_vec, kitten_vec):.3f}")  # 0.0
print(f"  cat vs airplane: {cosine_similarity(cat_vec, airplane_vec):.3f}")  # 0.0
# Both are 0 - one-hot cannot capture semantic similarity!

Word2Vec: The Distributional Hypothesis

Word2Vec (Mikolov et al., 2013) is based on the distributional hypothesis: "words that appear in similar contexts have similar meanings."

Word2Vec trains a shallow neural network to predict context from words (or words from context), and the learned weight matrix becomes the word embeddings.

Two architectures: 1. CBOW (Continuous Bag of Words): Predict the center word from surrounding context words. 2. Skip-gram: Predict surrounding context words from the center word.

Skip-gram works better for rare words; CBOW is faster to train.

Skip-gram Training

In Skip-gram, given a center word, we predict words within a window:

``` Sentence: "The quick brown fox jumps" Center: "brown", Window=2 Context pairs: (brown, The), (brown, quick), (brown, fox), (brown, jumps) ```

The network learns to maximize P(context | center) for real pairs and minimize it for random (negative) samples.

python

import numpy as np
from collections import Counter

class Word2VecSkipGram:
    """Simplified Word2Vec Skip-gram with negative sampling."""
    
    def __init__(self, vocab_size, embedding_dim=10, lr=0.01):
        self.vocab_size = vocab_size
        self.dim = embedding_dim
        self.lr = lr
        # Input embeddings (center words)
        self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Output embeddings (context words)
        self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def train_pair(self, center_idx, context_idx, neg_indices):
        """Train on one (center, context) pair with negative samples."""
        center_emb = self.W_in[center_idx]  # (dim,)
        
        # Positive sample
        ctx_emb = self.W_out[context_idx]
        score = np.dot(center_emb, ctx_emb)
        grad = (self.sigmoid(score) - 1)  # Gradient for positive
        
        self.W_in[center_idx] -= self.lr * grad * ctx_emb
        self.W_out[context_idx] -= self.lr * grad * center_emb
        
        # Negative samples
        for neg_idx in neg_indices:
            neg_emb = self.W_out[neg_idx]
            neg_score = np.dot(center_emb, neg_emb)
            neg_grad = self.sigmoid(neg_score)  # Gradient for negative
            
            self.W_in[center_idx] -= self.lr * neg_grad * neg_emb
            self.W_out[neg_idx] -= self.lr * neg_grad * center_emb
    
    def get_embedding(self, word_idx):
        return self.W_in[word_idx]
    
    def most_similar(self, word_idx, top_k=3):
        query = self.W_in[word_idx]
        best_word, best_sim = None, -1
        for i in range(self.vocab_size):
            if i == word_idx:
                continue
            sim = np.dot(query, self.W_in[i]) / (
                np.linalg.norm(query) * np.linalg.norm(self.W_in[i]) + 1e-8
            )
            if sim > best_sim:
                best_sim, best_word = sim, i
        return best_word, best_sim

# Quick demo
model = Word2VecSkipGram(vocab_size=10, embedding_dim=5)
# In practice, you'd train on millions of word pairs
print("Embedding for word 0:", model.get_embedding(0).round(3))

Famous Word Embedding Properties

Trained Word2Vec embeddings exhibit remarkable algebraic properties:

``` vector("King") - vector("Man") + vector("Woman") ~= vector("Queen") vector("Paris") - vector("France") + vector("Germany") ~= vector("Berlin") ```

This shows that embeddings capture semantic relationships as geometric directions in vector space.

python

# Demonstrating word analogy with pre-trained embeddings
# (Using simplified mock embeddings for illustration)

import numpy as np

# Mock 3D embeddings (in reality these are 100-300 dimensional)
embeddings = {
    "king":   np.array([0.9, 0.1, 0.8]),
    "queen":  np.array([0.9, 0.9, 0.8]),
    "man":    np.array([0.1, 0.1, 0.8]),
    "woman":  np.array([0.1, 0.9, 0.8]),
    "paris":  np.array([0.8, 0.5, 0.1]),
    "france": np.array([0.7, 0.4, 0.1]),
    "berlin": np.array([0.8, 0.5, 0.9]),
    "germany":np.array([0.7, 0.4, 0.9]),
}

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def analogy(a, b, c, embeddings):
    """a is to b as c is to ?"""
    target = embeddings[b] - embeddings[a] + embeddings[c]
    best_word, best_sim = None, -1
    for word, vec in embeddings.items():
        if word in [a, b, c]:
            continue
        sim = cosine_sim(target, vec)
        if sim > best_sim:
            best_sim, best_word = sim, word
    return best_word, best_sim

word, sim = analogy("man", "king", "woman", embeddings)
print(f"man:king :: woman:{word} (similarity: {sim:.3f})")
# Output: man:king :: woman:queen

word, sim = analogy("france", "paris", "germany", embeddings)
print(f"france:paris :: germany:{word} (similarity: {sim:.3f})")
# Output: france:paris :: germany:berlin

GloVe and FastText

GloVe (Global Vectors, Stanford 2014): - Trains on global word co-occurrence statistics from the entire corpus. - Combines the benefits of matrix factorization (LSA) and Word2Vec. - Often slightly outperforms Word2Vec on analogy tasks.

FastText (Facebook 2016): - Represents words as bags of character n-grams. - Can generate embeddings for out-of-vocabulary words. - Better for morphologically rich languages and rare words.

Modern alternatives: BERT, GPT, and other contextual embeddings have largely replaced static embeddings for most tasks, as they produce different vectors for the same word in different contexts.

Key Takeaways

One-hot encoding is sparse and cannot capture semantic similarity between words.
Word2Vec trains a neural network to predict context, producing dense semantic vectors.
Skip-gram predicts context from center word; CBOW predicts center from context.
Trained embeddings capture analogies: king - man + woman ~= queen.
GloVe uses global co-occurrence statistics; FastText handles out-of-vocabulary words.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com