Machine LearningBeginner12 min readUpdated March 2025

Linear Regression

Linear Regression is the foundation of supervised machine learning. It models the relationship between a dependent variable and one or more independent variables using a straight line, enabling predictions of continuous numeric outcomes.

What is Linear Regression?

Linear Regression is a statistical method that models the linear relationship between an input variable (or variables) and a continuous output variable. It is one of the oldest and most widely used algorithms in machine learning and statistics.

The goal is to find the best-fitting straight line (or hyperplane in higher dimensions) through the data points. This line is described by the equation:

y = mx + b

Where y is the predicted output, x is the input feature, m is the slope (weight), and b is the y-intercept (bias).

Simple vs Multiple Linear Regression

There are two main forms of linear regression:

Simple Linear Regression - One input feature predicts one output. Example: predicting house price from square footage.
Multiple Linear Regression - Multiple input features predict one output. Example: predicting house price from square footage, bedrooms, and location.
The general form for multiple regression is: y = w1*x1 + w2*x2 + ... + wn*xn + b
Each weight (w) represents the contribution of its corresponding feature to the prediction.

The Cost Function: Mean Squared Error

To train a linear regression model, we need to measure how wrong our predictions are. The most common metric is Mean Squared Error (MSE):

MSE = (1/n) * sum((y_pred - y_actual)^2)

We square the errors to: 1. Penalize large errors more heavily 2. Ensure the cost is always positive 3. Make the function differentiable for optimization

The training process minimizes this cost function by adjusting the weights and bias.

Gradient Descent Optimization

Gradient Descent is the algorithm used to minimize the cost function. It works by iteratively updating the model parameters in the direction that reduces the error:

w = w - learning_rate * (dCost/dw)

The learning rate controls how large each update step is. Too large and the model overshoots; too small and training is very slow.

Variants include: - Batch Gradient Descent - Uses all training examples per update - Stochastic Gradient Descent (SGD) - Uses one example per update - Mini-Batch Gradient Descent - Uses a small batch per update (most common in practice)

Implementing Linear Regression in Python

Here is a complete implementation using both scikit-learn and a manual gradient descent approach:

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# --- Generate synthetic data ---
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# --- Train/test split ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Scikit-learn model ---
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Coefficient (slope): {model.coef_[0][0]:.4f}")
print(f"Intercept (bias):    {model.intercept_[0]:.4f}")
print(f"MSE:  {mean_squared_error(y_test, y_pred):.4f}")
print(f"R^2:  {r2_score(y_test, y_pred):.4f}")

# --- Manual Gradient Descent ---
def gradient_descent(X, y, lr=0.01, epochs=1000):
    m, b = 0.0, 0.0
    n = len(X)
    for _ in range(epochs):
        y_pred = m * X + b
        dm = (-2/n) * np.sum(X * (y - y_pred))
        db = (-2/n) * np.sum(y - y_pred)
        m -= lr * dm
        b -= lr * db
    return m, b

m, b = gradient_descent(X_train.flatten(), y_train.flatten())
print(f"\nManual GD -> slope: {m:.4f}, intercept: {b:.4f}")

Assumptions of Linear Regression

Linear regression works best when these assumptions hold:

Linearity - The relationship between X and y is linear.
Independence - Observations are independent of each other.
Homoscedasticity - The variance of residuals is constant across all values of X.
Normality - Residuals are approximately normally distributed.
No multicollinearity - In multiple regression, features should not be highly correlated with each other.

Regularization: Ridge and Lasso

When a model overfits (performs well on training data but poorly on test data), regularization adds a penalty term to the cost function:

Ridge Regression (L2): Adds sum of squared weights as penalty. Shrinks all weights but keeps them non-zero.

Lasso Regression (L1): Adds sum of absolute weights as penalty. Can shrink some weights to exactly zero, performing feature selection.

Both are controlled by a hyperparameter alpha - higher alpha means stronger regularization.

Key Takeaways

Linear regression models the relationship between features and a continuous target using a weighted sum plus bias.
The model is trained by minimizing Mean Squared Error using gradient descent.
R-squared (R^2) measures how well the model explains variance in the data (1.0 = perfect fit).
Ridge (L2) and Lasso (L1) regularization prevent overfitting by penalizing large weights.
Always check the linearity assumption - if the relationship is non-linear, consider polynomial features or tree-based models.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com