Last modified: July 10, 2024

This article is written in: 🇺🇸

Linear Regression in Depth

Linear Regression is a fundamental type of supervised learning algorithm in statistics and machine learning. It's utilized for modeling and analyzing the relationship between a dependent variable and one or more independent variables. The goal is to predict continuous output values based on the input variables.

Mathematical Model

The hypothesis function in linear regression is given by:

$$ h_{\theta}(x) = \theta_0 + \theta_1x $$

where:

Cost Function (Mean Squared Error)

The cost function in linear regression measures how far off the predictions are from the actual outcomes. It's commonly represented as:

$$ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2 $$

This is the mean squared error (MSE) cost function, where:

Optimization: Gradient Descent

To find the optimal parameters ($\theta_0$ and $\theta_1$), we use gradient descent to minimize the cost function. The gradient descent algorithm iteratively adjusts the parameters to reduce the cost.

Notation

Training Process

  1. Input: Training set.
  2. Algorithm: The learning algorithm processes this data.
  3. Output: Hypothesis function $h$ which estimates the value of $y$ for a new input $x$.

Example: House Price Prediction

Using linear regression, we can predict house prices based on house size.

house price table

Analyzing the Cost Function

Different values of $\theta_1$ yield different cost (J) values:

Optimization aims to find the value of $\theta_1$ that minimizes $J(\theta_1)$.

cost_function

A Deeper Insight into the Cost Function - Simplified Cost Function

In linear regression, the cost function $J(\theta_0, \theta_1)$ is a critical component, involving two parameters: $\theta_0$ and $\theta_1$. Visualization of this function can be achieved through different plots.

3D Surface Plot

The 3D surface plot illustrates the cost function where:

surface_cost_function

Contour Plots

Contour plots offer a 2D perspective:

contour_cost_function

Gradient Descent Algorithm

The gradient descent algorithm iteratively adjusts $\theta_0$ and $\theta_1$ to minimize the cost function. The algorithm proceeds as follows:

θ = [0, 0]
while not converged:
    for j in [0, 1]:
        θ_j := θ_j - α ∂/∂θ_j J(θ_0, θ_1)

gradient_descent

Here is a complete Python code that generates mock data and implements the gradient descent algorithm to fit a linear regression model. The code also includes plotting the cost function and the path taken by the gradient descent algorithm.

import numpy as np
import matplotlib.pyplot as plt

# Generate mock data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Feature normalization
X_mean = np.mean(X)
X_std = np.std(X)
X_norm = (X - X_mean) / X_std

# Add bias term (X0 = 1)
X_b = np.c_[np.ones((100, 1)), X_norm]

# Initialize parameters
theta = np.random.randn(2, 1) * 0.01
alpha = 0.1
n_iterations = 1000
m = len(X_b)
tolerance = 1e-7

# Cost function
def compute_cost(theta, X_b, y):
    return (1 / (2 * m)) * np.sum((X_b.dot(theta) - y) ** 2)

# Gradient Descent
cost_history = []
theta_history = [theta]

for iteration in range(n_iterations):
    gradients = (1 / m) * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - alpha * gradients
    cost = compute_cost(theta, X_b, y)
    cost_history.append(cost)
    theta_history.append(theta.copy())
    if len(cost_history) > 1 and np.abs(cost_history[-1] - cost_history[-2]) < tolerance:
        break

# Prepare data for contour plot
theta_0_vals = np.linspace(-10, 10, 100)
theta_1_vals = np.linspace(-4, 4, 100)
cost_vals = np.zeros((len(theta_0_vals), len(theta_1_vals)))

for i in range(len(theta_0_vals)):
    for j in range(len(theta_1_vals)):
        theta_ij = np.array([[theta_0_vals[i]], [theta_1_vals[j]]])
        cost_vals[i, j] = compute_cost(theta_ij, X_b, y)

# Plotting
fig, ax = plt.subplots(figsize=(10, 6))

# Contour plot
theta_0_vals, theta_1_vals = np.meshgrid(theta_0_vals, theta_1_vals)
CS = ax.contour(theta_0_vals, theta_1_vals, cost_vals, levels=np.logspace(-2, 3, 20), cmap='viridis')
ax.clabel(CS, inline=1, fontsize=10)

# Plot the path of the gradient descent
theta_history = np.array(theta_history)
ax.plot(theta_history[:, 0], theta_history[:, 1], 'r-o', label='Gradient Descent Path')

# Annotate the start and end points
ax.annotate('Start', xy=(theta_history[0, 0], theta_history[0, 1]), xytext=(-9, 3),
            arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12, color='black')
ax.annotate('End', xy=(theta_history[-1, 0], theta_history[-1, 1]), xytext=(theta_history[-1, 0] + 1, theta_history[-1, 1]),
            arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12, color='black')

ax.set_xlabel(r'$\theta_0$', fontsize=14)
ax.set_ylabel(r'$\theta_1$', fontsize=14)
ax.set_title('Gradient Descent Path')
ax.legend()

plt.show()

# Plot cost function
plt.figure(figsize=(8, 4))
plt.plot(cost_history, 'b-', label='Cost Function')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('Cost Function Over Iterations')
plt.legend()
plt.show()

Below is the gradient descent path:

Gradient Descent Path

And here is the cost function over iterations:

Cost Function

Key Elements of Gradient Descent

Partial Derivative vs. Derivative

At a Local Minimum

At this point, the derivative equals zero, implying no further changes in $\theta_1$:

$$ \alpha \times 0 = 0 \Rightarrow \theta_1 = \theta_1 - 0 $$

local_minimum

Through gradient descent, the optimal $\theta_0$ and $\theta_1$ values are identified, minimizing the cost function and enhancing the linear regression model's performance.

Linear Regression with Gradient Descent

In linear regression, gradient descent is applied to minimize the squared error cost function $J(\theta_0, \theta_1)$. The process involves calculating the partial derivatives of the cost function with respect to each parameter $\theta_0$ and $\theta_1$.

Partial Derivatives of the Cost Function

The gradient of the cost function is computed as follows:

$$\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2$$

$$= \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2$$

The partial derivatives for each $\theta_j$ are:

$$\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)=\frac{\partial}{\partial \theta_0} \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})$$

$$\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)=\frac{\partial}{\partial \theta_1} \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$$

Normal Equations Method

For solving the minimization problem $\min J(\theta_0, \theta_1)$, the normal equations method offers an alternative to gradient descent. This numerical method provides an exact solution, avoiding the iterative approach of gradient descent. It can be faster for certain problems but is more complex and will be covered in detail later.

Below is the Python code to generate mock data and solve the minimization problem (\min J(\theta_0, \theta_1)) using the normal equations method.

import numpy as np
import matplotlib.pyplot as plt

# Generate mock data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term (X0 = 1)
X_b = np.c_[np.ones((100, 1)), X]

# Normal Equations method
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

# Display the results
print(f"Theta obtained by the Normal Equations method: {theta_best.ravel()}")

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(X, y, 'b.', label='Data Points')
plt.plot(X, X_b.dot(theta_best), 'r-', label='Prediction', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression using Normal Equations')
plt.legend()
plt.show()

Normal Equation Fit

Extension of the Current Model

The linear regression model can be extended to include multiple features. For example, in a housing model, features could include size, age, number of bedrooms, and number of floors. However, a challenge arises when dealing with more than three dimensions, as visualization becomes difficult. To effectively manage multiple features, linear algebra concepts like matrices and vectors are employed, facilitating calculations and interpretations in higher-dimensional spaces.

Reference

These notes are based on the free video lectures offered by Stanford University, led by Professor Andrew Ng. These lectures are part of the renowned Machine Learning course available on Coursera. For more information and to access the full course, visit the Coursera course page.

Table of Contents

  1. Linear Regression in Depth
    1. Mathematical Model
    2. Cost Function (Mean Squared Error)
    3. Optimization: Gradient Descent
    4. Notation
    5. Training Process
    6. Example: House Price Prediction
    7. Analyzing the Cost Function
    8. A Deeper Insight into the Cost Function - Simplified Cost Function
      1. 3D Surface Plot
      2. Contour Plots
    9. Gradient Descent Algorithm
      1. Key Elements of Gradient Descent
      2. Partial Derivative vs. Derivative
      3. At a Local Minimum
      4. Linear Regression with Gradient Descent
    10. Partial Derivatives of the Cost Function
    11. Normal Equations Method
    12. Extension of the Current Model
  2. Reference