Last modified: July 10, 2024

This article is written in: 🇺🇸

Linear Regression with Multiple Variables

Multiple linear regression extends the concept of simple linear regression to multiple independent variables. This technique models a dependent variable as a linear combination of several independent variables.

Hypothesis Function

The hypothesis in multiple linear regression combines all the features:

$$ h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4 $$

For simplification, introduce $x_0 = 1$ (bias term), making the feature vector $n + 1$-dimensional.

$$ h_{\theta}(x) = \theta^T X $$

Cost Function

The cost function measures the discrepancy between the model's predictions and actual values. It is defined as:

$$ J(\theta_0, \theta_1, ..., \theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2 $$

Gradient Descent for Multiple Variables

Gradient descent is employed to find the optimal parameters that minimize the cost function.

θ = [0] * n
while not converged:
  for j in [0, ..., n]:
      θ_j := θ_j - α ∂/∂θ_j J(θ_0, ..., θ_n)

Here is a Python code that demonstrates the gradient descent algorithm for multiple variables with mock data:

import numpy as np

# Mock data
# Features (x0, x1, x2)
X = np.array([
    [1, 2, 3],
    [1, 3, 4],
    [1, 4, 5],
    [1, 5, 6]
])

# Target values
y = np.array([7, 10, 13, 16])

# Parameters
alpha = 0.01  # Learning rate
num_iterations = 1000  # Number of iterations for gradient descent

# Initialize theta (parameters) to zeros
theta = np.zeros(X.shape[1])

# Cost function
def compute_cost(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
    return cost

# Gradient descent algorithm
def gradient_descent(X, y, theta, alpha, num_iterations):
    m = len(y)
    cost_history = np.zeros(num_iterations)
    
    for iteration in range(num_iterations):
        # Compute the prediction error
        error = X.dot(theta) - y
        
        # Update theta values simultaneously
        for j in range(len(theta)):
            partial_derivative = (1 / m) * np.sum(error * X[:, j])
            theta[j] = theta[j] - alpha * partial_derivative
        
        # Save the cost for the current iteration
        cost_history[iteration] = compute_cost(X, y, theta)
    
    return theta, cost_history

# Run gradient descent
theta, cost_history = gradient_descent(X, y, theta, alpha, num_iterations)

print("Optimized theta:", theta)
print("Final cost:", cost_history[-1])

# Plotting the cost function history
import matplotlib.pyplot as plt

plt.plot(range(num_iterations), cost_history, 'b')
plt.xlabel('Number of iterations')
plt.ylabel('Cost J')
plt.title('Cost function history')
plt.show()

Figure_1

Feature Scaling

When features have different scales, gradient descent may converge slowly.

feature_scaling

Here is the Python code that focuses on feature scaling using mock data:

import numpy as np

# Mock data
# Features (x0, x1, x2)
X = np.array([
    [1, 2000, 3],
    [1, 1600, 4],
    [1, 2400, 2],
    [1, 3000, 5]
])

# Function to perform feature scaling
def feature_scaling(X):
    # Exclude the first column (intercept term)
    X_scaled = X.copy()
    for i in range(1, X.shape[1]):
        mean = np.mean(X[:, i])
        std = np.std(X[:, i])
        X_scaled[:, i] = (X[:, i] - mean) / std
    return X_scaled

# Apply feature scaling
X_scaled = feature_scaling(X)

print("Original Features:\n", X)
print("Scaled Features:\n", X_scaled)

Mean Normalization

Adjust each feature $x_i$ by subtracting the mean and dividing by the range (max - min).

mean_normalization

Transforms the features to have approximately zero mean, aiding in faster convergence.

Below is the Python code that demonstrates mean normalization using mock data:

import numpy as np

# Mock data
# Features (x0, x1, x2)
X = np.array([
    [1, 2000, 3],
    [1, 1600, 4],
    [1, 2400, 2],
    [1, 3000, 5]
])

# Function to perform mean normalization
def mean_normalization(X):
    # Exclude the first column (intercept term)
    X_normalized = X.copy()
    for i in range(1, X.shape[1]):
        mean = np.mean(X[:, i])
        min_val = np.min(X[:, i])
        max_val = np.max(X[:, i])
        X_normalized[:, i] = (X[:, i] - mean) / (max_val - min_val)
    return X_normalized

# Apply mean normalization
X_normalized = mean_normalization(X)

print("Original Features:\n", X)
print("Mean Normalized Features:\n", X_normalized)

Learning Rate $\alpha$

min_cost_function

Automatic Convergence Tests

alpha_big

alpha_small

Features and Polynomial Regression

polynomial_regression

Normal Equation

Procedure

Example

$$ \theta = (X^TX)^{-1}X^Ty $$

normal_eq_table normal_eq_matrix

The computed $\theta$ values minimize the cost function for the given training data.

Here is the Python code that uses the provided data to solve for ( \theta ) using the normal equation, and then plots the solution:

import numpy as np
import matplotlib.pyplot as plt

# Given data
# Features (x0, x1, x2, x3, x4)
X = np.array([
    [1, 2104, 5, 1, 45],
    [1, 1416, 3, 2, 40],
    [1, 1534, 3, 2, 30],
    [1, 852, 2, 1, 36]
])

# Target values
y = np.array([460, 232, 315, 178])

# Normal equation: theta = (X^T * X)^-1 * X^T * y
def normal_equation(X, y):
    X_transpose = X.T
    theta = np.linalg.inv(X_transpose.dot(X)).dot(X_transpose).dot(y)
    return theta

# Calculate theta using the normal equation for multivariable regression
theta = normal_equation(X, y)
print("Calculated theta values for multivariable regression:", theta)

# Using the first feature (size) for plotting the regression line
sizes = X[:, 1]
predicted_prices = X.dot(theta)

# Plotting the regression line for multivariable regression
plt.scatter(sizes, y, color='red', label='Actual Prices')
plt.plot(sizes, predicted_prices, color='blue', label='Predicted Prices', linestyle='--')
plt.xlabel('Size (square feet)')
plt.ylabel('Price ($1000)')
plt.title('House Prices vs. Size (Multivariable Regression)')
plt.legend()
plt.show()

# Simple Linear Regression with size as the only feature
X_simple = X[:, [0, 1]]  # Only intercept term and size feature
theta_simple = normal_equation(X_simple, y)

# Predicting using the model with only size
predicted_prices_simple = X_simple.dot(theta_simple)

# Sorting the data by size for a proper line plot
sorted_indices = X[:, 1].argsort()
sizes_sorted = sizes[sorted_indices]
y_sorted = y[sorted_indices]
predicted_prices_sorted = predicted_prices_simple[sorted_indices]

# Plotting the regression line with size as the only feature
plt.scatter(sizes, y, color='red', label='Actual Prices')
plt.plot(sizes_sorted, predicted_prices_sorted, color='blue', label='Predicted Prices', linestyle='--')
plt.xlabel('Size (square feet)')
plt.ylabel('Price ($1000)')
plt.title('House Prices vs. Size (Simple Linear Regression)')
plt.legend()
plt.show()

print("Calculated theta values for simple linear regression:", theta_simple)

output(5)

Gradient Descent vs Normal Equation

Comparing these two methods helps understand their practical applications:

Aspect Gradient Descent Normal Equation
Learning Rate Requires selecting a learning rate No learning rate needed
Iterations Numerous iterations needed Direct computation without iterations
Efficiency Works well for large $n$ (even millions) Becomes slow for large $n$
Use Case Preferred for very large feature sets Ideal for smaller feature sets

Understanding when to use polynomial regression, and choosing between gradient descent and the normal equation, is crucial in developing efficient and effective linear regression models.

Reference

These notes are based on the free video lectures offered by Stanford University, led by Professor Andrew Ng. These lectures are part of the renowned Machine Learning course available on Coursera. For more information and to access the full course, visit the Coursera course page.

Table of Contents

  1. Linear Regression with Multiple Variables
    1. Hypothesis Function
    2. Cost Function
    3. Gradient Descent for Multiple Variables
    4. Feature Scaling
    5. Mean Normalization
    6. Learning Rate $\alpha$
    7. Automatic Convergence Tests
    8. Features and Polynomial Regression
    9. Normal Equation
      1. Procedure
      2. Example
    10. Gradient Descent vs Normal Equation
  2. Reference