Machine Learning Fundamentals

Regression: The Art of Drawing Lines Through Chaos 📈

The Big Picture: What is Regression?

Imagine you’re trying to predict how many hours of sleep you’ll get based on how many cups of coffee you drink. You’ve collected data for a month, and now you have a scatter plot that looks like someone sneezed dots onto a graph.

Regression is the statistical superhero that finds patterns in this chaos. It draws the “best” line (or curve) through your data points, allowing you to make predictions about the future.

In essence: Regression helps us understand relationships between variables and make predictions based on those relationships.

Types of Regression: The Family Tree 🌳

1. Linear Regression: The Classic

The simplest and most elegant member of the family. It assumes a straight-line relationship:

y = β₀ + β₁x + ε

Where:

y is what we’re predicting (dependent variable)
x is what we’re using to predict (independent variable)
β₀ is the intercept (where the line crosses the y-axis)
β₁ is the slope (how much y changes when x increases by 1)
ε is the error term (because life isn’t perfect)

2. Multiple Linear Regression: More Features, More Fun

When one variable isn’t enough:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Now we’re predicting sleep hours based on coffee cups AND stress level AND exercise minutes.

3. Polynomial Regression: When Lines Aren’t Enough

Sometimes relationships are curvy:

y = β₀ + β₁x + β₂x² + β₃x³ + ... + ε

4. Logistic Regression: The Probability Prophet

Despite its name, it’s for classification! Predicts probabilities between 0 and 1:

p(y=1) = 1 / (1 + e^(-z))
where z = β₀ + β₁x₁ + β₂x₂ + ...

The Mathematics: How Do We Find the Best Line? 🎯

The Least Squares Method

We want to minimize the sum of squared differences between predicted and actual values:

Loss = Σ(yᵢ - ŷᵢ)²

Where:

yᵢ is the actual value
ŷᵢ is our predicted value

The Normal Equation (Closed-Form Solution)

For simple linear regression, we can solve this directly:

β = (X^T X)^(-1) X^T y

This gives us the exact optimal parameters in one shot! But it requires matrix inversion, which can be computationally expensive for large datasets.

Gradient Descent Alternative

For large datasets or complex models, we use our old friend gradient descent:

def gradient_descent_regression(X, y, learning_rate=0.01, iterations=1000):
    m = len(y)  # number of samples
    
    # Initialize parameters
    theta = np.zeros(X.shape[1])
    
    for i in range(iterations):
        # Predictions
        y_pred = X.dot(theta)
        
        # Calculate gradients
        gradients = (1/m) * X.T.dot(y_pred - y)
        
        # Update parameters
        theta = theta - learning_rate * gradients
        
        # Optional: Calculate and print loss
        if i % 100 == 0:
            loss = (1/(2*m)) * np.sum((y_pred - y)**2)
            print(f"Iteration {i}, Loss: {loss:.4f}")
    
    return theta

Simple Linear Regression: A Complete Implementation 🛠️

Let’s build it from scratch to really understand what’s happening:

import numpy as np
import matplotlib.pyplot as plt

class SimpleLinearRegression:
    def __init__(self):
        self.slope = None
        self.intercept = None
    
    def fit(self, X, y):
        """
        Fit the regression line using the least squares method
        """
        # Convert to numpy arrays
        X = np.array(X)
        y = np.array(y)
        
        # Calculate means
        x_mean = np.mean(X)
        y_mean = np.mean(y)
        
        # Calculate slope (β₁)
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean) ** 2)
        self.slope = numerator / denominator
        
        # Calculate intercept (β₀)
        self.intercept = y_mean - self.slope * x_mean
        
        print(f"Equation: y = {self.intercept:.2f} + {self.slope:.2f}x")
    
    def predict(self, X):
        """
        Make predictions using the fitted line
        """
        X = np.array(X)
        return self.intercept + self.slope * X
    
    def score(self, X, y):
        """
        Calculate R² score (coefficient of determination)
        """
        X = np.array(X)
        y = np.array(y)
        
        y_pred = self.predict(X)
        
        # Total sum of squares
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        
        # Residual sum of squares
        ss_res = np.sum((y - y_pred) ** 2)
        
        # R² score
        r2 = 1 - (ss_res / ss_tot)
        
        return r2
    
    def plot(self, X, y):
        """
        Visualize the data and regression line
        """
        plt.figure(figsize=(10, 6))
        
        # Plot data points
        plt.scatter(X, y, color='blue', alpha=0.5, label='Data points')
        
        # Plot regression line
        X_line = np.linspace(min(X), max(X), 100)
        y_line = self.predict(X_line)
        plt.plot(X_line, y_line, color='red', linewidth=2, label='Regression line')
        
        plt.xlabel('X')
        plt.ylabel('y')
        plt.title('Simple Linear Regression')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()

# Example usage
if __name__ == "__main__":
    # Generate sample data
    np.random.seed(42)
    X = np.random.rand(100) * 10
    y = 2 * X + 5 + np.random.randn(100) * 2  # y = 2x + 5 + noise
    
    # Fit the model
    model = SimpleLinearRegression()
    model.fit(X, y)
    
    # Make predictions
    test_X = [3, 5, 7]
    predictions = model.predict(test_X)
    print(f"\nPredictions for X = {test_X}: {predictions}")
    
    # Evaluate
    r2 = model.score(X, y)
    print(f"R² score: {r2:.4f}")
    
    # Visualize
    model.plot(X, y)

Multiple Linear Regression: The Full Package 📦

When we have multiple features, things get more interesting:

class MultipleLinearRegression:
    def __init__(self, method='normal_equation'):
        self.method = method
        self.coefficients = None
        self.intercept = None
    
    def fit(self, X, y):
        """
        Fit the model using either normal equation or gradient descent
        """
        X = np.array(X)
        y = np.array(y).reshape(-1, 1)
        
        # Add bias term (column of ones)
        m = X.shape[0]
        X_with_bias = np.column_stack([np.ones(m), X])
        
        if self.method == 'normal_equation':
            # Closed-form solution: θ = (X^T X)^(-1) X^T y
            theta = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
        else:
            # Gradient descent
            theta = self._gradient_descent(X_with_bias, y)
        
        # Extract intercept and coefficients
        self.intercept = theta[0, 0]
        self.coefficients = theta[1:].flatten()
    
    def _gradient_descent(self, X, y, learning_rate=0.01, iterations=1000):
        """
        Perform gradient descent optimization
        """
        m = X.shape[0]
        n_features = X.shape[1]
        theta = np.zeros((n_features, 1))
        
        for i in range(iterations):
            # Calculate predictions
            y_pred = X @ theta
            
            # Calculate gradients
            gradients = (1/m) * X.T @ (y_pred - y)
            
            # Update parameters
            theta = theta - learning_rate * gradients
            
            # Optional: Print progress
            if i % 100 == 0:
                loss = (1/(2*m)) * np.sum((y_pred - y)**2)
                print(f"Iteration {i}, Loss: {loss:.6f}")
        
        return theta
    
    def predict(self, X):
        """
        Make predictions
        """
        X = np.array(X)
        return self.intercept + X @ self.coefficients
    
    def score(self, X, y):
        """
        Calculate R² score
        """
        y_pred = self.predict(X)
        ss_tot = np.sum((y - np.mean(y))**2)
        ss_res = np.sum((y - y_pred)**2)
        return 1 - (ss_res / ss_tot)

Assumptions of Linear Regression 📋

Before you go regression-crazy, remember these assumptions:

1. Linearity

The relationship between X and y should be linear. Check with scatter plots!

2. Independence

Observations should be independent of each other.

3. Homoscedasticity

The variance of residuals should be constant across all levels of X.

4. Normality

Residuals should be normally distributed (for inference, not prediction).

5. No Multicollinearity

In multiple regression, features shouldn’t be highly correlated with each other.

Regularization: When Your Model Gets Too Excited 🎯

Sometimes our model fits the training data TOO well (overfitting). Regularization adds a penalty for complexity:

Ridge Regression (L2 Regularization)

Let’s break down the formula piece by piece:

Loss = Σ(yᵢ - ŷᵢ)² + λΣβⱼ²

What each symbol means:

Σ(yᵢ - ŷᵢ)² = Original loss (sum of squared errors)
- yᵢ = Actual value for sample i
- ŷᵢ = Predicted value for sample i
- We square the differences and add them all up
λ (lambda) = Regularization strength (how much we penalize large coefficients)
- Small λ (like 0.01) = Light penalty, model stays complex
- Large λ (like 100) = Heavy penalty, model becomes simpler
Σβⱼ² = Sum of squared coefficients
- βⱼ = The coefficient for feature j (the slopes in our regression)
- We square each coefficient and add them up

What it does: Ridge regression says “Hey, you can fit the data, but I’m going to charge you for having large coefficients!” This encourages the model to use smaller, more reasonable values.

Lasso Regression (L1 Regularization)

Loss = Σ(yᵢ - ŷᵢ)² + λΣ|βⱼ|

The only difference from Ridge:

Σ|βⱼ| = Sum of absolute values of coefficients
- |βⱼ| = Absolute value (removes the sign, so -3 becomes 3)
- Instead of squaring, we just take absolute values

What it does: Lasso is more aggressive - it can actually shrink coefficients all the way to zero, effectively removing features from the model. It’s like Marie Kondo for your features: “Does this feature spark joy? No? It’s gone!”

Elastic Net (Best of Both Worlds)

Loss = Σ(yᵢ - ŷᵢ)² + λ₁Σ|βⱼ| + λ₂Σβⱼ²

Now we have:

λ₁ = How much we want Lasso-style regularization
λ₂ = How much we want Ridge-style regularization
We get both penalties at once!

Visual intuition:

No Regularization:    "Use any coefficients you want!"
Ridge (L2):          "Okay, but keep them reasonably small"
Lasso (L1):          "Keep them small, and I might zero some out"
Elastic Net:         "Let's do both - small values AND feature selection"

Example with actual numbers:

# Imagine we have coefficients: β₁=10, β₂=0.1, β₃=5

# Ridge penalty (λ=0.1):
L2_penalty = 0.1 × (10² + 0.1² + 5²) = 0.1 × 125.01 = 12.501

# Lasso penalty (λ=0.1):
L1_penalty = 0.1 × (|10| + |0.1| + |5|) = 0.1 × 15.1 = 1.51

# The large coefficient (β₁=10) contributes much more to Ridge penalty!

Polynomial Regression: When Lines Won’t Cut It 🌊

Sometimes relationships are curvy. We can still use linear regression by creating polynomial features:

def create_polynomial_features(X, degree):
    """
    Transform features to polynomial features
    """
    X_poly = X.copy()
    
    for d in range(2, degree + 1):
        X_poly = np.column_stack([X_poly, X**d])
    
    return X_poly

# Example: Fitting a quadratic relationship
X = np.array([1, 2, 3, 4, 5])
y = np.array([1, 4, 9, 16, 25])  # y = x²

# Create polynomial features
X_poly = create_polynomial_features(X.reshape(-1, 1), degree=2)

# Fit using multiple linear regression
model = MultipleLinearRegression()
model.fit(X_poly, y)

Evaluation Metrics: How Good Is Our Model? 📊

1. Mean Squared Error (MSE)

MSE = (1/n) Σ(yᵢ - ŷᵢ)²

Penalizes large errors heavily.

2. Root Mean Squared Error (RMSE)

RMSE = √MSE

In the same units as y, easier to interpret.

3. Mean Absolute Error (MAE)

MAE = (1/n) Σ|yᵢ - ŷᵢ|

Less sensitive to outliers than MSE.

4. R² Score (Coefficient of Determination)

R² = 1 - (SS_res / SS_tot)

Proportion of variance explained by the model. 1 = perfect, 0 = no better than mean.

Common Pitfalls and How to Avoid Them 🚨

1. Overfitting

Your model memorizes the training data.

Solution: Use regularization, cross-validation, or gather more data

2. Underfitting

Your model is too simple.

Solution: Add more features, try polynomial features, or use a more complex model

3. Extrapolation

Predicting far outside your training data range.

Solution: Be cautious! Regression models are best for interpolation

4. Correlation ≠ Causation

Just because ice cream sales correlate with shark attacks doesn’t mean one causes the other!

Solution: Use domain knowledge and experimental design

Practical Tips for Real-World Regression 💡

Always visualize your data first - Scatter plots can reveal patterns, outliers, and assumption violations
Feature engineering matters - Sometimes log-transforming a skewed variable or creating interaction terms can dramatically improve your model
Check residuals - Plot residuals vs. predicted values. They should look like random scatter
Use cross-validation - Don’t just evaluate on training data!
Start simple - Begin with linear regression before jumping to complex models
Scale your features - Especially important for regularized regression

Conclusion: The Power of Lines 🎯

Regression is like having a crystal ball that’s based on math instead of magic. It’s one of the most fundamental tools in machine learning and statistics, and for good reason:

It’s interpretable (you can understand what the model is doing)
It’s fast (closed-form solutions exist)
It’s robust (works well in many situations)
It’s the foundation for understanding more complex models

Remember: All models are wrong, but some are useful. Regression gives us a useful way to understand relationships and make predictions in an uncertain world.

“In regression, we trust… but always check the residuals!”

Happy modeling! 📈