Skip to main content
Machine LearningIntermediate20 min readUpdated March 2025

Decision Trees & Random Forests

Decision Trees split data using feature thresholds to make predictions. Random Forests combine hundreds of trees via bagging and feature randomness to create a powerful, robust ensemble that reduces overfitting.

How Decision Trees Work

A Decision Tree is a flowchart-like model that makes predictions by asking a series of yes/no questions about the input features. Each internal node represents a feature test, each branch represents an outcome, and each leaf node represents a prediction.

For example, to predict if a loan will default: 1. Is income > $50,000? If No -> High Risk 2. Is credit score > 700? If Yes -> Low Risk, else -> Medium Risk

The tree is built by recursively splitting the data to maximize purity - each split should separate classes as cleanly as possible.

Splitting Criteria

Decision trees use impurity measures to decide the best split at each node:

  • Gini Impurity - Measures the probability of misclassifying a randomly chosen element. Gini = 1 - sum(p_i^2). Used by CART algorithm.
  • Entropy / Information Gain - Measures the reduction in uncertainty after a split. Used by ID3 and C4.5 algorithms.
  • Mean Squared Error - Used for regression trees; minimizes variance within each leaf.
  • The feature and threshold that produce the highest information gain (or lowest impurity) are chosen for each split.

Overfitting and Pruning

Decision trees are prone to overfitting - they can grow deep enough to memorize the training data perfectly but generalize poorly.

Solutions: - max_depth - Limit how deep the tree can grow - min_samples_split - Minimum samples required to split a node - min_samples_leaf - Minimum samples required at a leaf node - Post-pruning - Grow the full tree, then remove branches that don't improve validation performance

Random Forests: Ensemble of Trees

A Random Forest is an ensemble of many decision trees, each trained on a random subset of the data and features. Predictions are made by majority vote (classification) or averaging (regression).

Two key sources of randomness: 1. Bootstrap sampling (Bagging) - Each tree trains on a random sample with replacement 2. Feature randomness - At each split, only a random subset of features is considered

This diversity among trees reduces variance and prevents overfitting - the hallmark weakness of single decision trees.

Implementing Decision Trees and Random Forests

Full implementation with feature importance analysis:

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ---- Decision Tree ----
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))
print(f"Decision Tree Accuracy: {dt_acc:.4f}")
print("\nTree structure (first 3 levels):")
print(export_text(dt, feature_names=data.feature_names, max_depth=3))

# ---- Random Forest ----
rf = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf.predict(X_test))
print(f"\nRandom Forest Accuracy: {rf_acc:.4f}")

# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5)
print(f"5-Fold CV Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Feature importance
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
print("\nTop 5 Most Important Features:")
for i in range(5):
    print(f"  {data.feature_names[indices[i]]}: {importances[indices[i]]:.4f}")

Gradient Boosting vs Random Forests

Both are tree ensembles but differ fundamentally:

  • Random Forest - Trees are built in parallel, independently. Fast to train. Reduces variance.
  • Gradient Boosting (XGBoost, LightGBM) - Trees are built sequentially, each correcting the errors of the previous. Reduces bias. Often more accurate but slower.
  • Random Forests are more robust to hyperparameter choices and less prone to overfitting.
  • Gradient Boosting typically achieves higher accuracy on structured/tabular data with proper tuning.
  • For most production use cases, start with Random Forest for speed and reliability.

Key Takeaways

  • Decision Trees split data recursively using Gini impurity or information gain to maximize class purity.
  • Single decision trees overfit easily - control depth with max_depth, min_samples_leaf, and pruning.
  • Random Forests combine 100s of trees using bagging and feature randomness to reduce variance.
  • Feature importance from Random Forests is a powerful tool for understanding which variables drive predictions.
  • Random Forests are robust, require minimal preprocessing, and handle missing values and mixed feature types well.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com