Imagine you’re trying to predict how many hours of sleep you’ll get based on how many cups of coffee you drink. You’ve collected data for a month, and now you have a scatter plot that looks like someone sneezed dots onto a graph.
Regression is the statistical superhero that finds patterns in this chaos. It draws the “best” line (or curve) through your data points, allowing you to make predictions about the future.
In essence: Regression helps us understand relationships between variables and make predictions based on those relationships.
The simplest and most elegant member of the family. It assumes a straight-line relationship:
y = β₀ + β₁x + ε
Where:
y
is what we’re predicting (dependent variable)x
is what we’re using to predict (independent variable)β₀
is the intercept (where the line crosses the y-axis)β₁
is the slope (how much y changes when x increases by 1)ε
is the error term (because life isn’t perfect)When one variable isn’t enough:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Now we’re predicting sleep hours based on coffee cups AND stress level AND exercise minutes.
Sometimes relationships are curvy:
y = β₀ + β₁x + β₂x² + β₃x³ + ... + ε
Despite its name, it’s for classification! Predicts probabilities between 0 and 1:
p(y=1) = 1 / (1 + e^(-z))
where z = β₀ + β₁x₁ + β₂x₂ + ...
We want to minimize the sum of squared differences between predicted and actual values:
Loss = Σ(yᵢ - ŷᵢ)²
Where:
yᵢ
is the actual valueŷᵢ
is our predicted valueFor simple linear regression, we can solve this directly:
β = (X^T X)^(-1) X^T y
This gives us the exact optimal parameters in one shot! But it requires matrix inversion, which can be computationally expensive for large datasets.
For large datasets or complex models, we use our old friend gradient descent:
def gradient_descent_regression(X, y, learning_rate=0.01, iterations=1000):
m = len(y) # number of samples
# Initialize parameters
theta = np.zeros(X.shape[1])
for i in range(iterations):
# Predictions
y_pred = X.dot(theta)
# Calculate gradients
gradients = (1/m) * X.T.dot(y_pred - y)
# Update parameters
theta = theta - learning_rate * gradients
# Optional: Calculate and print loss
if i % 100 == 0:
loss = (1/(2*m)) * np.sum((y_pred - y)**2)
print(f"Iteration {i}, Loss: {loss:.4f}")
return theta
Let’s build it from scratch to really understand what’s happening:
import numpy as np
import matplotlib.pyplot as plt
class SimpleLinearRegression:
def __init__(self):
self.slope = None
self.intercept = None
def fit(self, X, y):
"""
Fit the regression line using the least squares method
"""
# Convert to numpy arrays
X = np.array(X)
y = np.array(y)
# Calculate means
x_mean = np.mean(X)
y_mean = np.mean(y)
# Calculate slope (β₁)
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean) ** 2)
self.slope = numerator / denominator
# Calculate intercept (β₀)
self.intercept = y_mean - self.slope * x_mean
print(f"Equation: y = {self.intercept:.2f} + {self.slope:.2f}x")
def predict(self, X):
"""
Make predictions using the fitted line
"""
X = np.array(X)
return self.intercept + self.slope * X
def score(self, X, y):
"""
Calculate R² score (coefficient of determination)
"""
X = np.array(X)
y = np.array(y)
y_pred = self.predict(X)
# Total sum of squares
ss_tot = np.sum((y - np.mean(y)) ** 2)
# Residual sum of squares
ss_res = np.sum((y - y_pred) ** 2)
# R² score
r2 = 1 - (ss_res / ss_tot)
return r2
def plot(self, X, y):
"""
Visualize the data and regression line
"""
plt.figure(figsize=(10, 6))
# Plot data points
plt.scatter(X, y, color='blue', alpha=0.5, label='Data points')
# Plot regression line
X_line = np.linspace(min(X), max(X), 100)
y_line = self.predict(X_line)
plt.plot(X_line, y_line, color='red', linewidth=2, label='Regression line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Example usage
if __name__ == "__main__":
# Generate sample data
np.random.seed(42)
X = np.random.rand(100) * 10
y = 2 * X + 5 + np.random.randn(100) * 2 # y = 2x + 5 + noise
# Fit the model
model = SimpleLinearRegression()
model.fit(X, y)
# Make predictions
test_X = [3, 5, 7]
predictions = model.predict(test_X)
print(f"\nPredictions for X = {test_X}: {predictions}")
# Evaluate
r2 = model.score(X, y)
print(f"R² score: {r2:.4f}")
# Visualize
model.plot(X, y)
When we have multiple features, things get more interesting:
class MultipleLinearRegression:
def __init__(self, method='normal_equation'):
self.method = method
self.coefficients = None
self.intercept = None
def fit(self, X, y):
"""
Fit the model using either normal equation or gradient descent
"""
X = np.array(X)
y = np.array(y).reshape(-1, 1)
# Add bias term (column of ones)
m = X.shape[0]
X_with_bias = np.column_stack([np.ones(m), X])
if self.method == 'normal_equation':
# Closed-form solution: θ = (X^T X)^(-1) X^T y
theta = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
else:
# Gradient descent
theta = self._gradient_descent(X_with_bias, y)
# Extract intercept and coefficients
self.intercept = theta[0, 0]
self.coefficients = theta[1:].flatten()
def _gradient_descent(self, X, y, learning_rate=0.01, iterations=1000):
"""
Perform gradient descent optimization
"""
m = X.shape[0]
n_features = X.shape[1]
theta = np.zeros((n_features, 1))
for i in range(iterations):
# Calculate predictions
y_pred = X @ theta
# Calculate gradients
gradients = (1/m) * X.T @ (y_pred - y)
# Update parameters
theta = theta - learning_rate * gradients
# Optional: Print progress
if i % 100 == 0:
loss = (1/(2*m)) * np.sum((y_pred - y)**2)
print(f"Iteration {i}, Loss: {loss:.6f}")
return theta
def predict(self, X):
"""
Make predictions
"""
X = np.array(X)
return self.intercept + X @ self.coefficients
def score(self, X, y):
"""
Calculate R² score
"""
y_pred = self.predict(X)
ss_tot = np.sum((y - np.mean(y))**2)
ss_res = np.sum((y - y_pred)**2)
return 1 - (ss_res / ss_tot)
Before you go regression-crazy, remember these assumptions:
The relationship between X and y should be linear. Check with scatter plots!
Observations should be independent of each other.
The variance of residuals should be constant across all levels of X.
Residuals should be normally distributed (for inference, not prediction).
In multiple regression, features shouldn’t be highly correlated with each other.
Sometimes our model fits the training data TOO well (overfitting). Regularization adds a penalty for complexity:
Let’s break down the formula piece by piece:
Loss = Σ(yᵢ - ŷᵢ)² + λΣβⱼ²
What each symbol means:
Σ(yᵢ - ŷᵢ)²
= Original loss (sum of squared errors)
yᵢ
= Actual value for sample iŷᵢ
= Predicted value for sample iλ
(lambda) = Regularization strength (how much we penalize large coefficients)
Σβⱼ²
= Sum of squared coefficients
βⱼ
= The coefficient for feature j (the slopes in our regression)What it does: Ridge regression says “Hey, you can fit the data, but I’m going to charge you for having large coefficients!” This encourages the model to use smaller, more reasonable values.
Loss = Σ(yᵢ - ŷᵢ)² + λΣ|βⱼ|
The only difference from Ridge:
Σ|βⱼ|
= Sum of absolute values of coefficients
|βⱼ|
= Absolute value (removes the sign, so -3 becomes 3)What it does: Lasso is more aggressive - it can actually shrink coefficients all the way to zero, effectively removing features from the model. It’s like Marie Kondo for your features: “Does this feature spark joy? No? It’s gone!”
Loss = Σ(yᵢ - ŷᵢ)² + λ₁Σ|βⱼ| + λ₂Σβⱼ²
Now we have:
λ₁
= How much we want Lasso-style regularizationλ₂
= How much we want Ridge-style regularizationVisual intuition:
No Regularization: "Use any coefficients you want!"
Ridge (L2): "Okay, but keep them reasonably small"
Lasso (L1): "Keep them small, and I might zero some out"
Elastic Net: "Let's do both - small values AND feature selection"
Example with actual numbers:
# Imagine we have coefficients: β₁=10, β₂=0.1, β₃=5
# Ridge penalty (λ=0.1):
L2_penalty = 0.1 × (10² + 0.1² + 5²) = 0.1 × 125.01 = 12.501
# Lasso penalty (λ=0.1):
L1_penalty = 0.1 × (|10| + |0.1| + |5|) = 0.1 × 15.1 = 1.51
# The large coefficient (β₁=10) contributes much more to Ridge penalty!
Sometimes relationships are curvy. We can still use linear regression by creating polynomial features:
def create_polynomial_features(X, degree):
"""
Transform features to polynomial features
"""
X_poly = X.copy()
for d in range(2, degree + 1):
X_poly = np.column_stack([X_poly, X**d])
return X_poly
# Example: Fitting a quadratic relationship
X = np.array([1, 2, 3, 4, 5])
y = np.array([1, 4, 9, 16, 25]) # y = x²
# Create polynomial features
X_poly = create_polynomial_features(X.reshape(-1, 1), degree=2)
# Fit using multiple linear regression
model = MultipleLinearRegression()
model.fit(X_poly, y)
MSE = (1/n) Σ(yᵢ - ŷᵢ)²
Penalizes large errors heavily.
RMSE = √MSE
In the same units as y, easier to interpret.
MAE = (1/n) Σ|yᵢ - ŷᵢ|
Less sensitive to outliers than MSE.
R² = 1 - (SS_res / SS_tot)
Proportion of variance explained by the model. 1 = perfect, 0 = no better than mean.
Your model memorizes the training data.
Your model is too simple.
Predicting far outside your training data range.
Just because ice cream sales correlate with shark attacks doesn’t mean one causes the other!
Always visualize your data first - Scatter plots can reveal patterns, outliers, and assumption violations
Feature engineering matters - Sometimes log-transforming a skewed variable or creating interaction terms can dramatically improve your model
Check residuals - Plot residuals vs. predicted values. They should look like random scatter
Use cross-validation - Don’t just evaluate on training data!
Start simple - Begin with linear regression before jumping to complex models
Scale your features - Especially important for regularized regression
Regression is like having a crystal ball that’s based on math instead of magic. It’s one of the most fundamental tools in machine learning and statistics, and for good reason:
Remember: All models are wrong, but some are useful. Regression gives us a useful way to understand relationships and make predictions in an uncertain world.
“In regression, we trust… but always check the residuals!”
Happy modeling! 📈