Last modified: January 27, 2025

This article is written in: 🇺🇸

Multiple Linear Regression

Multiple linear regression is a statistical technique used to model the relationship between a single dependent variable and two or more independent variables. It extends the concept of simple linear regression by incorporating multiple predictors to explain the variability in the dependent variable. This method is widely used in fields such as economics, engineering, social sciences, and natural sciences to predict outcomes and understand the impact of various factors.

The Multiple Linear Regression Model

The general form of the multiple linear regression model is:

y=β0+β1x1+β2x2++βpxp+ε

Where:

Matrix Representation

In matrix notation, the model can be expressed as:

y=Xβ+ε

Where:

Assumptions of the Model

For the multiple linear regression model to provide valid results, several key assumptions must be met:

  1. Linearity means that the relationship between the dependent variable and each independent variable is linear.
  2. Independence assumes that the observations are independent of one another.
  3. Homoscedasticity ensures that the variance of the error terms remains constant across all levels of the independent variables.
  4. Normality requires that the error terms are normally distributed with a mean of zero.
  5. Finally, no multicollinearity ensures that the independent variables are not perfectly correlated with each other.

Estimation of Coefficients

Least Squares Method

The coefficients β are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals (the differences between observed and predicted values of y).

The objective is to find ˆβ such that:

ˆβ=argminβ(yXβ)(yXβ)

Solution Using Matrix Algebra

By taking the derivative of the sum of squared residuals with respect to β and setting it to zero, we obtain the normal equations:

XXˆβ=Xy

Solving for ˆβ:

ˆβ=(XX)1Xy

Conditions for Invertibility:

Interpretation of Coefficients

Assessing Model Fit

Coefficient of Determination (R2)

R2=1SSRSST

An R2 value close to 1 indicates a good fit.

Adjusted R2

Adjusts R2 for the number of predictors in the model:

Adjusted R2=1(SSR/(np1)SST/(n1))

Hypothesis Testing

Testing Individual Coefficients

tj=ˆβjSE(ˆβj)

Testing Overall Model Significance

F=(SSTSSR)/pSSR/(np1)

Diagnosing Multicollinearity

Variance Inflation Factor (VIF)

Measures how much the variance of an estimated coefficient increases due to multicollinearity:

VIFj=11R2j

Remedies

Assumption Diagnostics

Residual Analysis

Durbin-Watson Test

Checks for autocorrelation in residuals:

D=ni=2(eiei1)2ni=1e2i

Values close to 2 indicate no autocorrelation.

Extensions

Interaction Terms

Include products of independent variables to model interactions:

y=β0+β1x1+β2x2+β3(x1x2)+ε

Polynomial Regression

Model non-linear relationships by including polynomial terms:

y=β0+β1x+β2x2++βkxk+ε

Regularization Techniques

Ridge Regression: Adds penalty for large coefficients.

ˆβridge=(XX+λI)1Xy

Lasso Regression: Encourages sparsity in coefficients.

$$ \text{Minimize } \sum_{i=1}^n (y_i - \hat{y}i)^2 + \lambda \sum{j=1}^p |\beta_j| $$

Example: Multicollinearity Between Variables

Suppose we have the following data on the number of hours studied (x1), the number of practice exams taken (x2), and the test scores (y):

Hours Studied (x1) Practice Exams (x2) Test Score (y)
2 1 50
4 2 60
6 3 70
8 4 80

Observations

Before proceeding, it's important to notice that x2 is directly proportional to x1:

x2=12x1

This means that there is perfect multicollinearity between x1 and x2. In multiple linear regression, perfect multicollinearity causes the design matrix to be singular, making it impossible to uniquely estimate the regression coefficients for x1 and x2.

Implications of Multicollinearity

When independent variables are perfectly correlated, the matrix XX (where X is the design matrix) becomes singular (non-invertible). This prevents us from calculating the coefficients using the normal equation:

ˆβ=(XX)1Xy

Adjusted Approach

Given the perfect linear relationship between x1 and x2, we can simplify the model by combining x1 and x2 into a single variable or by using only one of them in the regression.

Simplifying the Model

Since x2=12x1, we can express y solely in terms of x1:

y=β0+β1x1+β2(12x1)=β0+(β1+β22)x1

Let γ=β1+β22. The model becomes:

y=β0+γx1

Now, we can proceed with a simple linear regression of y on x1.

Step-by-Step Calculation

1. Calculate the Means

Compute the mean of x1 and y:

ˉx1=2+4+6+84=204=5

ˉy=50+60+70+804=2604=65

2. Calculate the Sum of Squares

Compute the sum of squares for x1 and the cross-product of x1 and y:

SSx1x1=ni=1(x1iˉx1)2=(25)2+(45)2+(65)2+(85)2=20

SSx1y=ni=1(x1iˉx1)(yiˉy)=(25)(5065)+(45)(6065)+(65)(7065)+(85)(8065)=100

3. Calculate the Regression Coefficients

Compute the slope (ˆγ) and intercept (ˆβ0):

ˆγ=SSx1ySSx1x1=10020=5

ˆβ0=ˉyˆγˉx1=65(5)(5)=40

4. Write the Regression Equation

The best-fitting line is:

ˆy=40+5x1

5. Verify the Model with the Data

Compute the predicted y values using the regression equation:

For x1=2:

ˆy=40+5(2)=50

For x1=4:

ˆy=40+5(4)=60

For x1=6:

ˆy=40+5(6)=70

For x1=8:

ˆy=40+5(8)=80

The predicted values match the actual test scores perfectly.

Plot:

output(16)

Example: No Perfect Multicollinearity

Suppose we have the following data on the number of hours studied (x1), attendance rate (x2) as a percentage, and test scores (y):

Student (i) Hours Studied (x1i) Attendance Rate (x2i) Test Score (yi)
1 2 70 65
2 3 80 70
3 5 60 75
4 7 90 85
5 9 95 95

Objective

We aim to fit a multiple linear regression model of the form:

y=β0+β1x1+β2x2+ε

where:

Step-by-Step Calculation

1. Organize the Data

First, compute the necessary sums and products:

i x1i x2i yi x21i x22i x1ix2i x1iyi x2iyi
1 2 70 65 4 4,900 140 130 4,550
2 3 80 70 9 6,400 240 210 5,600
3 5 60 75 25 3,600 300 375 4,500
4 7 90 85 49 8,100 630 595 7,650
5 9 95 95 81 9,025 855 855 9,025
Total 26 395 390 168 32,025 2,165 2,165 31,325

2. Compute the Means

$$ \bar{x}1 = \frac{\sum x{1i}}{n} = \frac{26}{5} = 5.2 $$

$$ \bar{x}2 = \frac{\sum x{2i}}{n} = \frac{395}{5} = 79 $$

ˉy=yin=3905=78

3. Compute Sum of Squares and Cross Products

Sum of Squares for x1:

SSx1x1=x21inˉx21=1685(5.2)2=168135.2=32.8

Sum of Squares for x2:

SSx2x2=x22inˉx22=32,0255(79)2=32,02531,205=820

Sum of Cross Products between x1 and x2:

SSx1x2=x1ix2inˉx1ˉx2=2,1655(5.2)(79)=2,1652,054=111

Sum of Cross Products between x1 and y:

SSx1y=x1iyinˉx1ˉy=2,1655(5.2)(78)=2,1652,028=137

Sum of Cross Products between x2 and y:

SSx2y=x2iyinˉx2ˉy=31,3255(79)(78)=31,32530,810=515

4. Compute the Regression Coefficients

We use the formulas for multiple linear regression coefficients:

Denominator (Determinant):

D=SSx1x1SSx2x2(SSx1x2)2=(32.8)(820)(111)2=26,89612,321=14,575

Coefficient ˆβ1:

$$ \hat{\beta}1 = \frac{SS{x_1 y} SS_{x_2 x_2} - SS_{x_1 x_2} SS_{x_2 y}}{D} $$

ˆβ1=(137)(820)(111)(515)14,575=112,34057,16514,575=55,17514,5753.785

Coefficient ˆβ2:

$$ \hat{\beta}2 = \frac{SS{x_2 y} SS_{x_1 x_1} - SS_{x_1 x_2} SS_{x_1 y}}{D} $$

ˆβ2=(515)(32.8)(111)(137)14,575=16,89215,20714,575=1,68514,5750.116

Intercept ˆβ0:

ˆβ0=ˉyˆβ1ˉx1ˆβ2ˉx2

ˆβ0=78(3.785)(5.2)(0.116)(79)=7819.6429.16449.194

5. Write the Regression Equation

The estimated multiple linear regression model is:

ˆy=49.194+3.785x1+0.116x2

6. Interpret the Coefficients

7. Verify the Model with the Data

Compute the predicted test scores (ˆyi) and residuals (ei=yiˆyi).

For Student 1:

ˆy1=49.194+3.785(2)+0.116(70)=49.194+7.570+8.120=64.884

e1=y1ˆy1=6564.884=0.116

For Student 2:

ˆy2=49.194+3.785(3)+0.116(80)=49.194+11.355+9.280=69.829

e2=7069.829=0.171

For Student 3:

ˆy3=49.194+3.785(5)+0.116(60)=49.194+18.925+6.960=75.079

e3=7575.079=0.079

For Student 4:

ˆy4=49.194+3.785(7)+0.116(90)=49.194+26.495+10.440=86.129

e4=8586.129=1.129

For Student 5:

ˆy5=49.194+3.785(9)+0.116(95)=49.194+34.065+11.020=94.279

e5=9594.279=0.721

The residuals are small, indicating a good fit of the model to the data.

Plot:

output(17)

Checking for Multicollinearity

Compute the correlation coefficient between x1 and x2:

rx1x2=SSx1x2SSx1x1×SSx2x2

rx1x2=11132.8×820=11126,896=1111640.677

A correlation coefficient of approximately 0.677 indicates a moderate correlation, not perfect multicollinearity.

Table of Contents

  1. The Multiple Linear Regression Model
    1. Matrix Representation
  2. Assumptions of the Model
  3. Estimation of Coefficients
    1. Least Squares Method
    2. Solution Using Matrix Algebra
  4. Interpretation of Coefficients
  5. Assessing Model Fit
    1. Coefficient of Determination (R2)
    2. Adjusted R2
  6. Hypothesis Testing
    1. Testing Individual Coefficients
    2. Testing Overall Model Significance
  7. Diagnosing Multicollinearity
    1. Variance Inflation Factor (VIF)
    2. Remedies
  8. Assumption Diagnostics
    1. Residual Analysis
    2. Durbin-Watson Test
  9. Extensions
    1. Interaction Terms
    2. Polynomial Regression
    3. Regularization Techniques
  10. Example: Multicollinearity Between Variables
    1. Observations
    2. Implications of Multicollinearity
    3. Adjusted Approach
      1. Simplifying the Model
    4. Step-by-Step Calculation
      1. 1. Calculate the Means
      2. 2. Calculate the Sum of Squares
      3. 3. Calculate the Regression Coefficients
      4. 4. Write the Regression Equation
      5. 5. Verify the Model with the Data
  11. Example: No Perfect Multicollinearity
    1. Objective
    2. Step-by-Step Calculation
      1. 1. Organize the Data
      2. 2. Compute the Means
      3. 3. Compute Sum of Squares and Cross Products
      4. 4. Compute the Regression Coefficients
      5. 5. Write the Regression Equation
      6. 6. Interpret the Coefficients
      7. 7. Verify the Model with the Data
    3. Checking for Multicollinearity