Last modified: May 18, 2018
This article is written in: 🇺🇸
Evaluation metrics are essential tools for assessing the performance of statistical and machine learning models. They provide quantitative measures that help us understand how well a model is performing and where improvements can be made. In both classification and regression tasks, selecting appropriate evaluation metrics is crucial for model selection, tuning, and validation.
In classification tasks, the goal is to assign input data into predefined categories or classes. Evaluating classification models requires metrics that capture the correctness and reliability of the predictions.
A confusion matrix is a tabular representation of the performance of a classification model. It compares the actual target values with those predicted by the model, providing a breakdown of correct and incorrect predictions. The confusion matrix for a binary classification problem is typically structured as follows:
Predicted Positive | Predicted Negative | |
Actual Positive | True Positive (TP) | False Negative (FN) |
Actual Negative | False Positive (FP) | True Negative (TN) |
Accuracy measures the proportion of total correct predictions made by the model:
$$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} $$
While accuracy is intuitive, it can be misleading in imbalanced datasets where one class significantly outnumbers the other.
Precision quantifies the correctness of positive predictions made by the model:
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
High precision indicates a low rate of false positives.
Recall measures the model's ability to identify all actual positive cases:
$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
High recall indicates a low rate of false negatives.
Specificity assesses the model's ability to identify all actual negative cases:
$$ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} $$
High specificity indicates a low rate of false positives.
The F1 Score is the harmonic mean of precision and recall, providing a balance between the two metrics:
$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$
The F1 Score is particularly useful when dealing with imbalanced classes.
Imagine you are a librarian tasked with retrieving science fiction books from a large library that contains various genres.
Precision:
Recall:
High precision means that most of the books you picked are relevant (few false positives), while high recall means you found most of the relevant books (few false negatives).
Suppose we have a binary classification problem with the following confusion matrix:
Predicted Positive | Predicted Negative | Total | |
Actual Positive | TP = 70 | FN = 30 | 100 |
Actual Negative | FP = 20 | TN = 80 | 100 |
Total | 90 | 110 | 200 |
Accuracy:
$$ \text{Accuracy} = \frac{70 + 80}{200} = \frac{150}{200} = 0.75 \text{ or } 75\% $$
Precision:
$$ \text{Precision} = \frac{70}{70 + 20} = \frac{70}{90} \approx 0.778 \text{ or } 77.8\% $$
Recall:
$$ \text{Recall} = \frac{70}{70 + 30} = \frac{70}{100} = 0.7 \text{ or } 70\% $$
Specificity:
$$ \text{Specificity} = \frac{80}{80 + 20} = \frac{80}{100} = 0.8 \text{ or } 80\% $$
F1 Score:
$$ \text{F1 Score} = 2 \times \frac{0.778 \times 0.7}{0.778 + 0.7} \approx 2 \times \frac{0.5446}{1.478} \approx 0.736 \text{ or } 73.6\% $$
The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various threshold settings. It provides a comprehensive view of the model's performance across all classification thresholds.
True Positive Rate (TPR):
$$ \text{TPR} = \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
False Positive Rate (FPR):
$$ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $$
The Area Under the ROC Curve (AUC-ROC) is a single scalar value summarizing the model's ability to discriminate between positive and negative classes. An AUC of:
In cases of imbalanced datasets, the Precision-Recall (PR) curve is more informative than the ROC curve. It plots Precision against Recall for different thresholds.
Evaluation metrics are crucial in assessing the performance of regression models, which predict a continuous outcome variable based on one or more predictor variables. These metrics quantify the discrepancy between the predicted values and the actual observed values, providing insights into the accuracy and reliability of the model.
In regression analysis, the goal is to build a model that accurately predicts the dependent variable $y$ from one or more independent variables $x $. After fitting a regression model, it is essential to assess its performance using appropriate evaluation metrics. The most commonly used regression metrics include:
These metrics provide different perspectives on the model's predictive capabilities and can be used to compare different models or to tune model parameters.
Consider a dataset where we have generated data points from a sine wave with added noise. We fit a linear regression model to this data to predict $y$ based on $x $. The following plot illustrates the data and the fitted regression line:
This visualization helps to understand how well the linear model captures the underlying pattern in the data and highlights the discrepancies between the predictions and actual values.
Definition:
The Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average over the test sample of the absolute differences between predicted and actual observations.
Formula:
$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i | $$
Where:
Interpretation:
Properties:
Definition:
The Mean Squared Error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.
Formula:
$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} ( y_i - \hat{y}_i )^2 $$
Interpretation:
Properties:
Definition:
The Root Mean Squared Error (RMSE) is the square root of the MSE. It brings the error metric back to the same units as the dependent variable, making interpretation more intuitive.
Formula:
$$ \text{RMSE} = \sqrt{\text{MSE}} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} ( y_i - \hat{y}_i )^2 } $$
Interpretation:
Properties:
Definition:
The Coefficient of Determination, denoted as $R^2$, is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
Formula:
$$ R^2 = 1 - \frac{ \text{SS}{\text{res}} }{ \text{SS}{\text{tot}} } $$
$$ = 1 - \frac{ \sum_{i=1}^{n} ( y_i - \hat{y}i )^2 }{ \sum{i=1}^{n} ( y_i - \bar{y} )^2 } $$
Where:
Interpretation:
Properties:
Note:
Definition:
Adjusted $R^2$ adjusts the $R^2$ value based on the number of predictors in the model relative to the number of data points. It penalizes the addition of unnecessary predictors to the model.
Formula:
$$ \text{Adjusted } R^2 = 1 - \left( \frac{ (1 - R^2)(n - 1) }{ n - p - 1 } \right) $$
Where:
Interpretation:
Let's consider a small dataset:
$i $ | $x_i$ | $y_i$ | $\hat{y}_i$ |
1 | 1.0 | 2.0 | 2.5 |
2 | 2.0 | 4.0 | 3.8 |
3 | 3.0 | 6.0 | 5.9 |
4 | 4.0 | 8.0 | 8.2 |
5 | 5.0 | 10.0 | 10.1 |
Compute the residuals:
$$ \text{Residuals} = y_i - \hat{y}_i $$
Compute MAE:
$$ \text{MAE} = \frac{1}{5} \left( |2.0 - 2.5| + |4.0 - 3.8| + |6.0 - 5.9| + |8.0 - 8.2| + |10.0 - 10.1| \right) = \frac{1}{5} (0.5 + 0.2 + 0.1 + 0.2 + 0.1) = 0.22 $$
Compute MSE:
$$ \text{MSE} = \frac{1}{5} \left( (2.0 - 2.5)^2 + (4.0 - 3.8)^2 + (6.0 - 5.9)^2 + (8.0 - 8.2)^2 + (10.0 - 10.1)^2 \right) = \frac{1}{5} (0.25 + 0.04 + 0.01 + 0.04 + 0.01) = 0.07 $$
Compute RMSE:
$$ \text{RMSE} = \sqrt{ \text{MSE} } = \sqrt{0.07} \approx 0.265 $$
Compute $R^2$:
First, compute the total sum of squares:
$$ \bar{y} = \frac{1}{5} (2.0 + 4.0 + 6.0 + 8.0 + 10.0) = 6.0 $$
$$ \text{SS}{\text{tot}} = \sum{i=1}^{5} ( y_i - \bar{y} )^2 = (2.0 - 6.0)^2 + (4.0 - 6.0)^2 + (6.0 - 6.0)^2 + (8.0 - 6.0)^2 + (10.0 - 6.0)^2 = 16 + 4 + 0 + 4 + 16 = 40 $$
Compute the residual sum of squares:
$$ \text{SS}{\text{res}} = \sum{i=1}^{5} ( y_i - \hat{y}_i )^2 = 0.25 + 0.04 + 0.01 + 0.04 + 0.01 = 0.35 $$
Compute $R^2$:
$$ R^2 = 1 - \frac{ \text{SS}{\text{res}} }{ \text{SS}{\text{tot}} } = 1 - \frac{0.35}{40} = 1 - 0.00875 = 0.99125 $$
Interpretation:
Although $R^2$ typically ranges from 0 to 1, it can be negative when the model is worse than simply predicting the mean of the observed data. This can happen if the residual sum of squares is greater than the total sum of squares.
Example:
Suppose we have a model that makes poor predictions resulting in a high residual sum of squares:
Compute $R^2$:
$$ R^2 = 1 - \frac{50}{40} = 1 - 1.25 = -0.25 $$
Interpretation: