Last modified: September 16, 2024
This article is written in: 🇺🇸
Logistic regression is a statistical method used for modeling the probability of a binary outcome based on one or more predictor variables. It is widely used in various fields such as medicine, social sciences, and machine learning for classification problems where the dependent variable is dichotomous (e.g., success/failure, yes/no, positive/negative).
The Logistic Regression Model
Binary Outcomes
In logistic regression, the dependent variable $y $ takes on binary values:
$$ y_i = \begin{cases} 1 & \text{if event occurs (e.g., success)} \\ 0 & \text{if event does not occur (e.g., failure)} \end{cases} $$
Logistic Function (Sigmoid Function)
The logistic function models the probability that $y = 1$ given the predictor variables $x_1, x_2, \dots, x_p $:
$$ p(x) = P(y = 1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p)}} $$
This function maps any real-valued number into a value between 0 and 1, making it suitable for probability estimation.
Odds and Logit Function
The odds of an event occurring is the ratio of the probability that the event occurs to the probability that it does not occur.
$$ \text{Odds}(x) = \frac{p(x)}{1 - p(x)} $$
The natural logarithm of the odds is called the logit function.
$$ \text{logit}(p(x)) = \ln \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p $$
The logit function establishes a linear relationship between the predictor variables and the log-odds of the outcome.
Interpretation of Coefficients
- The intercept ($\beta_0$) represents the log-odds of the outcome when all predictor variables are zero.
- The coefficients ($\beta_j$) indicate the change in the log-odds of the outcome for a one-unit increase in $x_j$, while keeping all other variables constant.
Exponentiating the coefficients provides the odds ratios:
$$ \text{Odds Ratio for } x_j = e^{\beta_j} $$
An odds ratio greater than 1 indicates that as $x_j $ increases, the odds of the outcome occurring increase.
Estimation of Coefficients
Unlike linear regression, logistic regression does not have a closed-form solution for estimating the coefficients. Instead, coefficients are estimated using Maximum Likelihood Estimation (MLE).
Likelihood Function
Given $n $ independent observations, the likelihood function $L(\boldsymbol{\beta})$ is the product of the probabilities of observing the data:
$$ L(\boldsymbol{\beta}) = \prod_{i=1}^{n} [p(x_i)]^{y_i} [1 - p(x_i)]^{1 - y_i} $$
Log-Likelihood Function
Taking the natural logarithm simplifies the product into a sum:
$$ \ell(\boldsymbol{\beta}) = \ln L(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \ln p(x_i) + (1 - y_i) \ln (1 - p(x_i)) \right] $$
Maximizing the Log-Likelihood
The goal is to find the coefficients $\boldsymbol{\beta}$ that maximize $\ell(\boldsymbol{\beta})$. This is typically done using iterative numerical methods such as:
- Newton-Raphson method
- Fisher Scoring
- Gradient Descent
Newton-Raphson Method
An iterative method that updates the coefficient estimates using:
$$ \boldsymbol{\beta}^{(k+1)} = \boldsymbol{\beta}^{(k)} - [\mathbf{H}(\boldsymbol{\beta}^{(k)})]^{-1} \mathbf{g}(\boldsymbol{\beta}^{(k)}) $$
Where:
- $\mathbf{g}(\boldsymbol{\beta})$ is the gradient (first derivative) of the log-likelihood function.
- $\mathbf{H}(\boldsymbol{\beta})$ is the Hessian matrix (second derivative) of the log-likelihood function.
Assumptions of Logistic Regression
- The binary outcome indicates that the dependent variable has two possible values.
- Independent observations ensure that each data point is independent of the others.
- Linearity in log-odds means that the logit of the outcome is a linear function of the predictor variables.
- No multicollinearity ensures that the predictor variables are not highly correlated with each other.
- A large sample size is preferred, as maximum likelihood estimation (MLE) performs better with more data.
Example
Problem Statement
We aim to predict whether a student will pass a test ($y = 1$) or fail ($y = 0 $) based on:
- $x_1$: Number of hours studied
- $x_2$: Number of practice exams taken
Data
Student ($i $) | Hours Studied ($x_{1i}$) | Practice Exams ($x_{2i}$) | Pass ($y_i $) |
1 | 2 | 1 | 0 |
2 | 4 | 2 | 0 |
3 | 5 | 2 | 1 |
4 | 6 | 3 | 1 |
5 | 8 | 4 | 1 |
Step-by-Step Estimation
1. Initial Setup
We need to estimate $\beta_0 $, $\beta_1$, and $\beta_2$ in the logistic regression model:
$$ p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}} $$
2. Construct the Design Matrix and Response Vector
Let $\mathbf{X}$ be the design matrix including the intercept term:
$$ \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\ 1 & x_{14} & x_{24} \\ 1 & x_{15} & x_{25} \\ \end{bmatrix} = \begin{bmatrix} 1 & 2 & 1 \\ 1 & 4 & 2 \\ 1 & 5 & 2 \\ 1 & 6 & 3 \\ 1 & 8 & 4 \\ \end{bmatrix} $$
Response vector $\mathbf{y}$:
$$ \mathbf{y} = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 1 \\ \end{bmatrix} $$
3. Initialize Coefficients
Start with initial guesses for $\boldsymbol{\beta}$, say:
$$ \boldsymbol{\beta}^{(0)} = \begin{bmatrix} \beta_0^{(0)} \\ \beta_1^{(0)} \\ \beta_2^{(0)} \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix} $$
4. Iterative Optimization (Simplified Explanation)
Due to the complexity of MLE calculations, we will provide a simplified explanation. In practice, statistical software performs these calculations.
Iteration Steps:
- Compute predicted probabilities $p_i^{(k)}$ using current coefficients.
$$ p_i^{(k)} = \frac{1}{1 + e^{-(\beta_0^{(k)} + \beta_1^{(k)} x_{1i} + \beta_2^{(k)} x_{2i})}} $$
- Compute the gradient $\mathbf{g}(\boldsymbol{\beta}^{(k)})$ and Hessian $\mathbf{H}(\boldsymbol{\beta}^{(k)})$.
- Update coefficients using the Newton-Raphson formula.
- The process continues until the change in the log-likelihood or coefficients is below a predefined threshold.
5. Estimated Coefficients
Assuming the optimization converges, we obtain estimated coefficients (using statistical software):
$$ \hat{\beta}_0 = -9.28, \quad \hat{\beta}_1 = 1.23, \quad \hat{\beta}_2 = 0.98 $$
6. Logistic Regression Equation
$$ p(x) = \frac{1}{1 + e^{-(-9.28 + 1.23 x_1 + 0.98 x_2)}} $$
Interpretation of Coefficients
Intercept ($\beta_0 = -9.28$): The log-odds of passing when $x_1 = 0 $ and $x_2 = 0 $.
Coefficient for Hours Studied ($\beta_1 = 1.23$): For each additional hour studied, the log-odds of passing increase by 1.23 units, holding $x_2$ constant.
Odds Ratio:
$$ e^{1.23} \approx 3.42 $$
The odds of passing are approximately 3.42 times higher for each additional hour studied.
Coefficient for Practice Exams ($\beta_2 = 0.98$): For each additional practice exam taken, the log-odds of passing increase by 0.98 units, holding $x_1$ constant.
Odds Ratio:
$$ e^{0.98} \approx 2.66 $$
The odds of passing are approximately 2.66 times higher for each additional practice exam taken.
Predicting Probabilities
Predict for a Student Who Studied 5 Hours and Took 2 Practice Exams
Compute the probability of passing:
$$ p = \frac{1}{1 + e^{-(-9.28 + 1.23 \times 5 + 0.98 \times 2)}} = \frac{1}{1 + e^{-(-9.28 + 6.15 + 1.96)}} = \frac{1}{1 + e^{-(-1.17)}} = \frac{1}{1 + e^{1.17}} \approx 0.24 $$
The probability of passing is approximately 24%.
Predict for a Student Who Studied 7 Hours and Took 3 Practice Exams
$$ p = \frac{1}{1 + e^{-(-9.28 + 1.23 \times 7 + 0.98 \times 3)}} = \frac{1}{1 + e^{-(-9.28 + 8.61 + 2.94)}} = \frac{1}{1 + e^{-2.27}} = \frac{1}{1 + e^{-2.27}} \approx 0.906 $$
The probability of passing is approximately 90.6%.
Plot:
Model Evaluation
Confusion Matrix
Using a threshold of 0.5 to classify predictions:
Student ($i $) | $y_i $ | Predicted $p_i $ | Predicted Class ($\hat{y}_i $) |
1 | 0 | $p_1 \approx 0.005 $ | 0 |
2 | 0 | $p_2 \approx 0.046 $ | 0 |
3 | 1 | $p_3 \approx 0.24 $ | 0 |
4 | 1 | $p_4 \approx 0.82$ | 1 |
5 | 1 | $p_5 \approx 0.99 $ | 1 |
- True Positives (TP): 2 (Students 4 and 5)
- True Negatives (TN): 2 (Students 1 and 2)
- False Negatives (FN): 1 (Student 3)
- False Positives (FP): 0
Accuracy
$$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total}} = \frac{2 + 2}{5} = 0.8 $$
Precision
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{2}{2 + 0} = 1.0 $$
Recall (Sensitivity)
$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{2}{2 + 1} = \frac{2}{3} \approx 0.667 $$
F1 Score
$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \times \frac{1.0 \times 0.667}{1.0 + 0.667} \approx 0.8 $$
Receiver Operating Characteristic (ROC) Curve
- Plotting True Positive Rate (Recall) vs. False Positive Rate at various thresholds.
- Calculate Area Under the ROC Curve (AUC) to assess model performance.