Last modified: September 21, 2024
This article is written in: 🇺🇸
Simple linear regression is a fundamental statistical method used to model the relationship between a single dependent variable and one independent variable. It aims to find the best-fitting straight line through the data points, which can be used to predict the dependent variable based on the independent variable.
The Simple Linear Regression Model
The mathematical representation of the simple linear regression model is:
$$ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1, 2, \dots, n $$
Where:
- $y_i$ is the $i $-th observation of the dependent variable.
- $x_i$ is the $i $-th observation of the independent variable.
- $\beta_0$ is the intercept (the expected value of $y$ when $x = 0$).
- $\beta_1$ is the slope (the average change in $y$ for a one-unit change in $x$).
- $\varepsilon_i$ is the error term, assumed to be independently and identically distributed with mean zero and constant variance $\sigma^2 $.
Assumptions of the Model
For the simple linear regression model to be valid, several key assumptions must be met:
- Linearity indicates that the relationship between $x$ and $y$ is linear.
- Independence means the residuals (errors) $\varepsilon_i$ are independent of one another.
- Homoscedasticity suggests that the residuals have a constant variance ($\sigma^2$) across all values of $x$.
- Normality assumes that the residuals follow a normal distribution.
- Finally, no measurement error in $x ensures that the independent variable $x$ is measured without error.
Estimation of Coefficients Using the Least Squares Method
The goal is to find estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the sum of squared residuals (differences between observed and predicted values):
$$ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \sum{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 $$
Calculating the Slope ($\hat{\beta}_1$) and Intercept ($\hat{\beta}_0$)
The least squares estimates are calculated using the following formulas:
Slope ($\hat{\beta}_1$)
$$ \hat{\beta}1 = \frac{\sum{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)} $$
Intercept ($\hat{\beta}_0$)
$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$
Where:
- $\bar{x} = \dfrac{1}{n} \sum_{i=1}^{n} x_i$ is the mean of the independent variable.
- $\bar{y} = \dfrac{1}{n} \sum_{i=1}^{n} y_i$ is the mean of the dependent variable.
Interpretation of the Coefficients
- The intercept ($\hat{\beta}_0$) represents the expected value of $y$ when $x = 0$, marking the point where the regression line intersects the $y$-axis.
- The slope ($\hat{\beta}_1$) indicates the average change in $y$ for each one-unit increase in $x$.
Assessing the Fit of the Model
Total Sum of Squares (SST)
Measures the total variability in the dependent variable:
$$ \text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2 $$
Regression Sum of Squares (SSR)
Measures the variability explained by the regression:
$$ \text{SSR} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 $$
Sum of Squared Errors (SSE)
Measures the unexplained variability:
$$ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
Coefficient of Determination ($R^2 $)
Indicates the proportion of variance in $y$ explained by $x$:
$$ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} $$
An $R^2 $ value close to 1 suggests a strong linear relationship.
Hypothesis Testing
Testing the Significance of the Slope
- Null Hypothesis ($H_0$): $\beta_1 = 0$ (no linear relationship)
- Alternative Hypothesis ($H_a $): $\beta_1 \neq 0$
Test Statistic:
$$ t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} $$
Where:
$$ \text{SE}(\hat{\beta}1) = \frac{s}{\sqrt{\sum{i=1}^{n} (x_i - \bar{x})^2}} $$
And:
$$ s = \sqrt{\frac{\text{SSE}}{n - 2}} $$
The test statistic follows a $t $-distribution with $n - 2 $ degrees of freedom.
Example
Suppose we have the following data on the number of hours studied ($x$) and the test scores ($y$):
Hours Studied ($x$) | Test Score ($y$) |
2 | 50 |
4 | 60 |
6 | 70 |
8 | 80 |
Step-by-Step Calculation
1. Calculate the Means
$$ \bar{x} = \frac{2 + 4 + 6 + 8}{4} = 5 $$
$$ \bar{y} = \frac{50 + 60 + 70 + 80}{4} = 65 $$
2. Compute the Sum of Squares and Cross-Products
Create a table to organize calculations:
$x_i$ | $y_i$ | $x_i - \bar{x} $ | $y_i - \bar{y} $ | $(x_i - \bar{x})(y_i - \bar{y})$ | $(x_i - \bar{x})^2 $ |
2 | 50 | -3 | -15 | 45 | 9 |
4 | 60 | -1 | -5 | 5 | 1 |
6 | 70 | 1 | 5 | 5 | 1 |
8 | 80 | 3 | 15 | 45 | 9 |
Total | 100 | 20 |
Sum of Cross-Products:
$$ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) = 100 $$
Sum of Squares of $x$:
$$ \sum_{i=1}^{n} (x_i - \bar{x})^2 = 20 $$
3. Calculate the Slope ($\hat{\beta}_1$)
$$ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \frac{100}{20} = 5 $$
4. Calculate the Intercept ($\hat{\beta}_0$)
$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} = 65 - (5)(5) = 40 $$
5. Formulate the Regression Equation
$$ \hat{y} = 40 + 5x $$
6. Predict the Test Scores and Calculate Residuals
$x_i$ | $y_i$ | $\hat{y}_i = 40 + 5x_i$ | Residuals ($y_i - \hat{y}_i$) |
2 | 50 | 50 | 0 |
4 | 60 | 60 | 0 |
6 | 70 | 70 | 0 |
8 | 80 | 80 | 0 |
7. Calculate the Sum of Squares
Total Sum of Squares (SST):
$$ \text{SST} = \sum (y_i - \bar{y})^2 = (-15)^2 + (-5)^2 + 5^2 + 15^2 = 500 $$
Sum of Squared Errors (SSE):
$$ \text{SSE} = \sum (y_i - \hat{y}_i)^2 = 0 $$
Regression Sum of Squares (SSR):
$$ \text{SSR} = \text{SST} - \text{SSE} = 500 - 0 = 500 $$
8. Compute the Coefficient of Determination ($R^2 $)
$$ R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{500}{500} = 1 $$
An $R^2 $ value of 1 indicates that the model perfectly explains the variability in the test scores.
9. Calculate the Standard Error of the Estimate ($s $)
$$ s = \sqrt{\frac{\text{SSE}}{n - 2}} = \sqrt{\frac{0}{2}} = 0 $$
Since $s = 0$, the standard errors of the coefficients are zero, which is a result of the perfect fit.
10. Hypothesis Testing (Optional)
In this case, the test statistic for $\hat{\beta}_1$ is undefined due to division by zero in the standard error. However, in practice, with more realistic data where $s > 0$, you would perform a $t $-test to assess the significance of the slope.