Text Preprocessing
Text preprocessing converts raw text into a clean, structured format suitable for NLP models. This article covers tokenization, lowercasing, stop word removal, stemming, and lemmatization.
Why Preprocess Text?
Raw text is messy - it contains punctuation, HTML tags, inconsistent capitalization, abbreviations, and irrelevant words. NLP models need clean, normalized input to learn meaningful patterns.
Text preprocessing is typically the first step in any NLP pipeline:
``` Raw Text -> Clean -> Tokenize -> Normalize -> Feature Extraction -> Model ```
Lowercasing and Punctuation Removal
The simplest preprocessing steps normalize text to a consistent format:
import re
import string
def basic_clean(text: str) -> str:
"""Lowercase and remove punctuation."""
# Lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
# Remove HTML tags
def remove_html(text: str) -> str:
return re.sub(r'<[^>]+>', '', text)
# Remove URLs
def remove_urls(text: str) -> str:
return re.sub(r'http\S+|www\.\S+', '', text)
# Remove numbers (optional)
def remove_numbers(text: str) -> str:
return re.sub(r'\d+', '', text)
# Example
raw = "<p>Hello, World! Visit https://example.com for 100% FREE tips!</p>"
cleaned = remove_html(raw)
cleaned = remove_urls(cleaned)
cleaned = basic_clean(cleaned)
print(cleaned)
# Output: "hello world for free tips"Tokenization
Tokenization splits text into individual units (tokens) - usually words or subwords.
Word tokenization: Split on whitespace and punctuation. Sentence tokenization: Split text into sentences. Subword tokenization: Used by modern LLMs (BPE, WordPiece) - handles unknown words by splitting into known subword units.
# Word tokenization (simple)
def word_tokenize(text: str) -> list:
return text.lower().split()
# Better tokenization with regex
import re
def regex_tokenize(text: str) -> list:
# Keep words and contractions, remove punctuation
return re.findall(r"\b[a-zA-Z']+\b", text.lower())
text = "I can't believe it's already 2025! AI is amazing." print("Simple split:", word_tokenize(text))
print("Regex tokens:", regex_tokenize(text))
# Using NLTK (if available)
try:
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize as nltk_tokenize, sent_tokenize
print("\nNLTK word tokens:", nltk_tokenize(text))
paragraph = "AI is transforming the world. Neural networks power modern AI. The future is exciting." print("Sentences:", sent_tokenize(paragraph))
except ImportError:
print("\nNLTK not installed - using regex tokenizer")Stop Word Removal
Stop words are common words (the, is, at, which) that carry little semantic meaning. Removing them reduces noise and vocabulary size.
# Common English stop words
STOP_WORDS = {
'a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
'should', 'may', 'might', 'shall', 'can', 'need', 'dare', 'ought',
'i', 'me', 'my', 'we', 'our', 'you', 'your', 'he', 'she', 'it',
'they', 'them', 'their', 'this', 'that', 'these', 'those',
'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'from', 'about',
'and', 'or', 'but', 'not', 'so', 'yet', 'both', 'either', 'neither',
}
def remove_stop_words(tokens: list) -> list:
return [t for t in tokens if t not in STOP_WORDS]
tokens = ['the', 'neural', 'network', 'is', 'learning', 'from', 'data']
filtered = remove_stop_words(tokens)
print("Before:", tokens)
print("After: ", filtered)
# After: ['neural', 'network', 'learning', 'data']Stemming and Lemmatization
Both techniques reduce words to their base form, but differ in approach:
Stemming: Crude rule-based truncation. Fast but produces non-words. - "running" -> "run", "studies" -> "studi", "better" -> "better"
Lemmatization: Uses vocabulary and morphological analysis. Slower but accurate. - "running" -> "run", "studies" -> "study", "better" -> "good"
import re
import string
# Simple Porter Stemmer (rule-based)
def simple_stem(word: str) -> str:
"""Simplified stemming rules."""
word = word.lower()
# Remove common suffixes
for suffix in ['ing', 'tion', 'ness', 'ment', 'er', 'ed', 'ly', 'es', 's']:
if word.endswith(suffix) and len(word) - len(suffix) >= 3:
return word[:-len(suffix)]
return word
words = ['running', 'studies', 'happiness', 'quickly', 'played', 'cats']
stemmed = [simple_stem(w) for w in words]
print("Original: ", words)
print("Stemmed: ", stemmed)
# With NLTK (more accurate)
try:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet', quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
test_words = ['running', 'studies', 'better', 'geese', 'went']
print("\nNLTK Stemming: ", [stemmer.stem(w) for w in test_words])
print("NLTK Lemmatization: ", [lemmatizer.lemmatize(w) for w in test_words])
except ImportError:
print("\nNLTK not installed")Complete Preprocessing Pipeline
Putting it all together into a reusable pipeline:
import re
import string
STOP_WORDS = {'a', 'an', 'the', 'is', 'are', 'was', 'in', 'on', 'at', 'to', 'for', 'of', 'and', 'or', 'but', 'not', 'it', 'this', 'that'}
def preprocess(text: str, remove_stops=True) -> list:
"""Full NLP preprocessing pipeline."""
# 1. Remove HTML and URLs
text = re.sub(r'<[^>]+>', '', text)
text = re.sub(r'http\S+', '', text)
# 2. Lowercase
text = text.lower()
# 3. Remove punctuation and numbers
text = re.sub(r'[^a-z\s]', '', text)
# 4. Tokenize
tokens = text.split()
# 5. Remove stop words
if remove_stops:
tokens = [t for t in tokens if t not in STOP_WORDS]
# 6. Remove short tokens
tokens = [t for t in tokens if len(t) > 2]
return tokens
# Test
text = "<p>The quick brown fox jumps over the lazy dog! Visit https://example.com</p>"
result = preprocess(text)
print("Processed tokens:", result)
# Output: ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']Key Takeaways
- Text preprocessing converts raw text to clean, normalized tokens for NLP models.
- Lowercasing, punctuation removal, and URL stripping are basic cleaning steps.
- Tokenization splits text into words or subwords - the unit of NLP processing.
- Stop word removal eliminates common words with little semantic value.
- Stemming is fast but crude; lemmatization is slower but produces real words.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com