Word Embeddings (Word2Vec)
Word embeddings represent words as dense numerical vectors that capture semantic meaning. This article covers one-hot encoding limitations, Word2Vec (CBOW and Skip-gram), and GloVe.
The Problem with One-Hot Encoding
The simplest way to represent words is one-hot encoding: a vector of zeros with a single 1 at the word's index.
Problems: - Dimensionality: Vocabulary of 50,000 words -> 50,000-dimensional vectors. - No semantic similarity: "cat" and "kitten" are as different as "cat" and "airplane". - Sparse: Almost all values are zero - computationally wasteful.
Word embeddings solve this by mapping words to dense, low-dimensional vectors (typically 100-300 dimensions) where semantically similar words are close together.
import numpy as np
# One-hot encoding (naive approach)
vocab = ["cat", "kitten", "dog", "airplane", "car"]
word_to_idx = {word: i for i, word in enumerate(vocab)}
def one_hot(word, vocab_size):
vec = np.zeros(vocab_size)
vec[word_to_idx[word]] = 1
return vec
cat_vec = one_hot("cat", len(vocab))
kitten_vec = one_hot("kitten", len(vocab))
airplane_vec = one_hot("airplane", len(vocab))
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
print("One-hot similarity:")
print(f" cat vs kitten: {cosine_similarity(cat_vec, kitten_vec):.3f}") # 0.0
print(f" cat vs airplane: {cosine_similarity(cat_vec, airplane_vec):.3f}") # 0.0
# Both are 0 - one-hot cannot capture semantic similarity!Word2Vec: The Distributional Hypothesis
Word2Vec (Mikolov et al., 2013) is based on the distributional hypothesis: "words that appear in similar contexts have similar meanings."
Word2Vec trains a shallow neural network to predict context from words (or words from context), and the learned weight matrix becomes the word embeddings.
Two architectures: 1. CBOW (Continuous Bag of Words): Predict the center word from surrounding context words. 2. Skip-gram: Predict surrounding context words from the center word.
Skip-gram works better for rare words; CBOW is faster to train.
Skip-gram Training
In Skip-gram, given a center word, we predict words within a window:
``` Sentence: "The quick brown fox jumps" Center: "brown", Window=2 Context pairs: (brown, The), (brown, quick), (brown, fox), (brown, jumps) ```
The network learns to maximize P(context | center) for real pairs and minimize it for random (negative) samples.
import numpy as np
from collections import Counter
class Word2VecSkipGram:
"""Simplified Word2Vec Skip-gram with negative sampling."""
def __init__(self, vocab_size, embedding_dim=10, lr=0.01):
self.vocab_size = vocab_size
self.dim = embedding_dim
self.lr = lr
# Input embeddings (center words)
self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
# Output embeddings (context words)
self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def train_pair(self, center_idx, context_idx, neg_indices):
"""Train on one (center, context) pair with negative samples."""
center_emb = self.W_in[center_idx] # (dim,)
# Positive sample
ctx_emb = self.W_out[context_idx]
score = np.dot(center_emb, ctx_emb)
grad = (self.sigmoid(score) - 1) # Gradient for positive
self.W_in[center_idx] -= self.lr * grad * ctx_emb
self.W_out[context_idx] -= self.lr * grad * center_emb
# Negative samples
for neg_idx in neg_indices:
neg_emb = self.W_out[neg_idx]
neg_score = np.dot(center_emb, neg_emb)
neg_grad = self.sigmoid(neg_score) # Gradient for negative
self.W_in[center_idx] -= self.lr * neg_grad * neg_emb
self.W_out[neg_idx] -= self.lr * neg_grad * center_emb
def get_embedding(self, word_idx):
return self.W_in[word_idx]
def most_similar(self, word_idx, top_k=3):
query = self.W_in[word_idx]
best_word, best_sim = None, -1
for i in range(self.vocab_size):
if i == word_idx:
continue
sim = np.dot(query, self.W_in[i]) / (
np.linalg.norm(query) * np.linalg.norm(self.W_in[i]) + 1e-8
)
if sim > best_sim:
best_sim, best_word = sim, i
return best_word, best_sim
# Quick demo
model = Word2VecSkipGram(vocab_size=10, embedding_dim=5)
# In practice, you'd train on millions of word pairs
print("Embedding for word 0:", model.get_embedding(0).round(3))Famous Word Embedding Properties
Trained Word2Vec embeddings exhibit remarkable algebraic properties:
``` vector("King") - vector("Man") + vector("Woman") ~= vector("Queen") vector("Paris") - vector("France") + vector("Germany") ~= vector("Berlin") ```
This shows that embeddings capture semantic relationships as geometric directions in vector space.
# Demonstrating word analogy with pre-trained embeddings
# (Using simplified mock embeddings for illustration)
import numpy as np
# Mock 3D embeddings (in reality these are 100-300 dimensional)
embeddings = {
"king": np.array([0.9, 0.1, 0.8]),
"queen": np.array([0.9, 0.9, 0.8]),
"man": np.array([0.1, 0.1, 0.8]),
"woman": np.array([0.1, 0.9, 0.8]),
"paris": np.array([0.8, 0.5, 0.1]),
"france": np.array([0.7, 0.4, 0.1]),
"berlin": np.array([0.8, 0.5, 0.9]),
"germany":np.array([0.7, 0.4, 0.9]),
}
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def analogy(a, b, c, embeddings):
"""a is to b as c is to ?"""
target = embeddings[b] - embeddings[a] + embeddings[c]
best_word, best_sim = None, -1
for word, vec in embeddings.items():
if word in [a, b, c]:
continue
sim = cosine_sim(target, vec)
if sim > best_sim:
best_sim, best_word = sim, word
return best_word, best_sim
word, sim = analogy("man", "king", "woman", embeddings)
print(f"man:king :: woman:{word} (similarity: {sim:.3f})")
# Output: man:king :: woman:queen
word, sim = analogy("france", "paris", "germany", embeddings)
print(f"france:paris :: germany:{word} (similarity: {sim:.3f})")
# Output: france:paris :: germany:berlinGloVe and FastText
GloVe (Global Vectors, Stanford 2014): - Trains on global word co-occurrence statistics from the entire corpus. - Combines the benefits of matrix factorization (LSA) and Word2Vec. - Often slightly outperforms Word2Vec on analogy tasks.
FastText (Facebook 2016): - Represents words as bags of character n-grams. - Can generate embeddings for out-of-vocabulary words. - Better for morphologically rich languages and rare words.
Modern alternatives: BERT, GPT, and other contextual embeddings have largely replaced static embeddings for most tasks, as they produce different vectors for the same word in different contexts.
Key Takeaways
- One-hot encoding is sparse and cannot capture semantic similarity between words.
- Word2Vec trains a neural network to predict context, producing dense semantic vectors.
- Skip-gram predicts context from center word; CBOW predicts center from context.
- Trained embeddings capture analogies: king - man + woman ~= queen.
- GloVe uses global co-occurrence statistics; FastText handles out-of-vocabulary words.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com