AiBeginner10 min readUpdated March 2025

Text Preprocessing

Text preprocessing converts raw text into a clean, structured format suitable for NLP models. This article covers tokenization, lowercasing, stop word removal, stemming, and lemmatization.

Why Preprocess Text?

Raw text is messy - it contains punctuation, HTML tags, inconsistent capitalization, abbreviations, and irrelevant words. NLP models need clean, normalized input to learn meaningful patterns.

Text preprocessing is typically the first step in any NLP pipeline:

``` Raw Text -> Clean -> Tokenize -> Normalize -> Feature Extraction -> Model ```

Lowercasing and Punctuation Removal

The simplest preprocessing steps normalize text to a consistent format:

python

import re
import string

def basic_clean(text: str) -> str:
    """Lowercase and remove punctuation."""
    # Lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Remove HTML tags
def remove_html(text: str) -> str:
    return re.sub(r'<[^>]+>', '', text)

# Remove URLs
def remove_urls(text: str) -> str:
    return re.sub(r'http\S+|www\.\S+', '', text)

# Remove numbers (optional)
def remove_numbers(text: str) -> str:
    return re.sub(r'\d+', '', text)

# Example
raw = "<p>Hello, World! Visit https://example.com for 100% FREE tips!</p>"
cleaned = remove_html(raw)
cleaned = remove_urls(cleaned)
cleaned = basic_clean(cleaned)
print(cleaned)
# Output: "hello world for  free tips"

Tokenization

Tokenization splits text into individual units (tokens) - usually words or subwords.

Word tokenization: Split on whitespace and punctuation. Sentence tokenization: Split text into sentences. Subword tokenization: Used by modern LLMs (BPE, WordPiece) - handles unknown words by splitting into known subword units.

python

# Word tokenization (simple)
def word_tokenize(text: str) -> list:
    return text.lower().split()

# Better tokenization with regex
import re

def regex_tokenize(text: str) -> list:
    # Keep words and contractions, remove punctuation
    return re.findall(r"\b[a-zA-Z']+\b", text.lower())

text = "I can't believe it's already 2025! AI is amazing." print("Simple split:", word_tokenize(text))
print("Regex tokens:", regex_tokenize(text))

# Using NLTK (if available)
try:
    import nltk
    nltk.download('punkt', quiet=True)
    from nltk.tokenize import word_tokenize as nltk_tokenize, sent_tokenize
    
    print("\nNLTK word tokens:", nltk_tokenize(text))
    
    paragraph = "AI is transforming the world. Neural networks power modern AI. The future is exciting." print("Sentences:", sent_tokenize(paragraph))
except ImportError:
    print("\nNLTK not installed - using regex tokenizer")

Stop Word Removal

Stop words are common words (the, is, at, which) that carry little semantic meaning. Removing them reduces noise and vocabulary size.

python

# Common English stop words
STOP_WORDS = {
    'a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
    'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
    'should', 'may', 'might', 'shall', 'can', 'need', 'dare', 'ought',
    'i', 'me', 'my', 'we', 'our', 'you', 'your', 'he', 'she', 'it',
    'they', 'them', 'their', 'this', 'that', 'these', 'those',
    'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'from', 'about',
    'and', 'or', 'but', 'not', 'so', 'yet', 'both', 'either', 'neither',
}

def remove_stop_words(tokens: list) -> list:
    return [t for t in tokens if t not in STOP_WORDS]

tokens = ['the', 'neural', 'network', 'is', 'learning', 'from', 'data']
filtered = remove_stop_words(tokens)
print("Before:", tokens)
print("After: ", filtered)
# After: ['neural', 'network', 'learning', 'data']

Stemming and Lemmatization

Both techniques reduce words to their base form, but differ in approach:

Stemming: Crude rule-based truncation. Fast but produces non-words. - "running" -> "run", "studies" -> "studi", "better" -> "better"

Lemmatization: Uses vocabulary and morphological analysis. Slower but accurate. - "running" -> "run", "studies" -> "study", "better" -> "good"

python

import re
import string

# Simple Porter Stemmer (rule-based)
def simple_stem(word: str) -> str:
    """Simplified stemming rules."""
    word = word.lower()
    # Remove common suffixes
    for suffix in ['ing', 'tion', 'ness', 'ment', 'er', 'ed', 'ly', 'es', 's']:
        if word.endswith(suffix) and len(word) - len(suffix) >= 3:
            return word[:-len(suffix)]
    return word

words = ['running', 'studies', 'happiness', 'quickly', 'played', 'cats']
stemmed = [simple_stem(w) for w in words]
print("Original: ", words)
print("Stemmed:  ", stemmed)

# With NLTK (more accurate)
try:
    from nltk.stem import PorterStemmer, WordNetLemmatizer
    nltk.download('wordnet', quiet=True)
    
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    
    test_words = ['running', 'studies', 'better', 'geese', 'went']
    print("\nNLTK Stemming:      ", [stemmer.stem(w) for w in test_words])
    print("NLTK Lemmatization: ", [lemmatizer.lemmatize(w) for w in test_words])
except ImportError:
    print("\nNLTK not installed")

Complete Preprocessing Pipeline

Putting it all together into a reusable pipeline:

python

import re
import string

STOP_WORDS = {'a', 'an', 'the', 'is', 'are', 'was', 'in', 'on', 'at', 'to', 'for', 'of', 'and', 'or', 'but', 'not', 'it', 'this', 'that'}

def preprocess(text: str, remove_stops=True) -> list:
    """Full NLP preprocessing pipeline."""
    # 1. Remove HTML and URLs
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'http\S+', '', text)
    
    # 2. Lowercase
    text = text.lower()
    
    # 3. Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    
    # 4. Tokenize
    tokens = text.split()
    
    # 5. Remove stop words
    if remove_stops:
        tokens = [t for t in tokens if t not in STOP_WORDS]
    
    # 6. Remove short tokens
    tokens = [t for t in tokens if len(t) > 2]
    
    return tokens

# Test
text = "<p>The quick brown fox jumps over the lazy dog! Visit https://example.com</p>"
result = preprocess(text)
print("Processed tokens:", result)
# Output: ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']

Key Takeaways

Text preprocessing converts raw text to clean, normalized tokens for NLP models.
Lowercasing, punctuation removal, and URL stripping are basic cleaning steps.
Tokenization splits text into words or subwords - the unit of NLP processing.
Stop word removal eliminates common words with little semantic value.
Stemming is fast but crude; lemmatization is slower but produces real words.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com