Last modified: September 18, 2024

This article is written in: 🇺🇸

Evaluation Metrics

Evaluation metrics are essential tools for assessing the performance of statistical and machine learning models. They provide quantitative measures that help us understand how well a model is performing and where improvements can be made. In both classification and regression tasks, selecting appropriate evaluation metrics is crucial for model selection, tuning, and validation.

Classification Metrics

In classification tasks, the goal is to assign input data into predefined categories or classes. Evaluating classification models requires metrics that capture the correctness and reliability of the predictions.

Confusion Matrix

A confusion matrix is a tabular representation of the performance of a classification model. It compares the actual target values with those predicted by the model, providing a breakdown of correct and incorrect predictions. The confusion matrix for a binary classification problem is typically structured as follows:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Key Metrics Derived from the Confusion Matrix

Accuracy

Accuracy measures the proportion of total correct predictions made by the model:

$$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} $$

While accuracy is intuitive, it can be misleading in imbalanced datasets where one class significantly outnumbers the other.

Precision

Precision quantifies the correctness of positive predictions made by the model:

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

High precision indicates a low rate of false positives.

Recall (Sensitivity or True Positive Rate)

Recall measures the model's ability to identify all actual positive cases:

$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

High recall indicates a low rate of false negatives.

Specificity (True Negative Rate)

Specificity assesses the model's ability to identify all actual negative cases:

$$ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} $$

High specificity indicates a low rate of false positives.

F1 Score

The F1 Score is the harmonic mean of precision and recall, providing a balance between the two metrics:

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

The F1 Score is particularly useful when dealing with imbalanced classes.

Intuitive Explanation of Precision and Recall

Imagine you are a librarian tasked with retrieving science fiction books from a large library that contains various genres.

Precision:

Recall:

High precision means that most of the books you picked are relevant (few false positives), while high recall means you found most of the relevant books (few false negatives).

Example Calculation

Suppose we have a binary classification problem with the following confusion matrix:

Predicted Positive Predicted Negative Total
Actual Positive TP = 70 FN = 30 100
Actual Negative FP = 20 TN = 80 100
Total 90 110 200

Accuracy:

$$ \text{Accuracy} = \frac{70 + 80}{200} = \frac{150}{200} = 0.75 \text{ or } 75\% $$

Precision:

$$ \text{Precision} = \frac{70}{70 + 20} = \frac{70}{90} \approx 0.778 \text{ or } 77.8\% $$

Recall:

$$ \text{Recall} = \frac{70}{70 + 30} = \frac{70}{100} = 0.7 \text{ or } 70\% $$

Specificity:

$$ \text{Specificity} = \frac{80}{80 + 20} = \frac{80}{100} = 0.8 \text{ or } 80\% $$

F1 Score:

$$ \text{F1 Score} = 2 \times \frac{0.778 \times 0.7}{0.778 + 0.7} \approx 2 \times \frac{0.5446}{1.478} \approx 0.736 \text{ or } 73.6\% $$

Receiver Operating Characteristic (ROC) Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various threshold settings. It provides a comprehensive view of the model's performance across all classification thresholds.

True Positive Rate (TPR):

$$ \text{TPR} = \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

False Positive Rate (FPR):

$$ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $$

The Area Under the ROC Curve (AUC-ROC) is a single scalar value summarizing the model's ability to discriminate between positive and negative classes. An AUC of:

Interpretation

Precision-Recall Curve and AUC

In cases of imbalanced datasets, the Precision-Recall (PR) curve is more informative than the ROC curve. It plots Precision against Recall for different thresholds.

Regression Metrics

Evaluation metrics are crucial in assessing the performance of regression models, which predict a continuous outcome variable based on one or more predictor variables. These metrics quantify the discrepancy between the predicted values and the actual observed values, providing insights into the accuracy and reliability of the model.

Evaluating Regression Models

In regression analysis, the goal is to build a model that accurately predicts the dependent variable $y$ from one or more independent variables $x $. After fitting a regression model, it is essential to assess its performance using appropriate evaluation metrics. The most commonly used regression metrics include:

These metrics provide different perspectives on the model's predictive capabilities and can be used to compare different models or to tune model parameters.

Visualizing Regression Performance

Consider a dataset where we have generated data points from a sine wave with added noise. We fit a linear regression model to this data to predict $y$ based on $x $. The following plot illustrates the data and the fitted regression line:

Linear Regression on Sine Wave with Noise

This visualization helps to understand how well the linear model captures the underlying pattern in the data and highlights the discrepancies between the predictions and actual values.

Regression Evaluation Metrics

Mean Absolute Error (MAE)

Definition:

The Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average over the test sample of the absolute differences between predicted and actual observations.

Formula:

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i | $$

Where:

Interpretation:

Properties:

Mean Squared Error (MSE)

Definition:

The Mean Squared Error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

Formula:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} ( y_i - \hat{y}_i )^2 $$

Interpretation:

Properties:

Root Mean Squared Error (RMSE)

Definition:

The Root Mean Squared Error (RMSE) is the square root of the MSE. It brings the error metric back to the same units as the dependent variable, making interpretation more intuitive.

Formula:

$$ \text{RMSE} = \sqrt{\text{MSE}} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} ( y_i - \hat{y}_i )^2 } $$

Interpretation:

Properties:

Coefficient of Determination ($R^2$)

Definition:

The Coefficient of Determination, denoted as $R^2$, is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

Formula:

$$ R^2 = 1 - \frac{ \text{SS}{\text{res}} }{ \text{SS}{\text{tot}} } $$

$$ = 1 - \frac{ \sum_{i=1}^{n} ( y_i - \hat{y}i )^2 }{ \sum{i=1}^{n} ( y_i - \bar{y} )^2 } $$

Where:

Interpretation:

Properties:

Note:

Adjusted $R^2$

Definition:

Adjusted $R^2$ adjusts the $R^2$ value based on the number of predictors in the model relative to the number of data points. It penalizes the addition of unnecessary predictors to the model.

Formula:

$$ \text{Adjusted } R^2 = 1 - \left( \frac{ (1 - R^2)(n - 1) }{ n - p - 1 } \right) $$

Where:

Interpretation:

Comparing Regression Metrics

Sensitivity to Outliers

Units and Interpretability

Usage in Model Evaluation

Example Calculation

Let's consider a small dataset:

$i $ $x_i$ $y_i$ $\hat{y}_i$
1 1.0 2.0 2.5
2 2.0 4.0 3.8
3 3.0 6.0 5.9
4 4.0 8.0 8.2
5 5.0 10.0 10.1

Compute the residuals:

$$ \text{Residuals} = y_i - \hat{y}_i $$

Compute MAE:

$$ \text{MAE} = \frac{1}{5} \left( |2.0 - 2.5| + |4.0 - 3.8| + |6.0 - 5.9| + |8.0 - 8.2| + |10.0 - 10.1| \right) = \frac{1}{5} (0.5 + 0.2 + 0.1 + 0.2 + 0.1) = 0.22 $$

Compute MSE:

$$ \text{MSE} = \frac{1}{5} \left( (2.0 - 2.5)^2 + (4.0 - 3.8)^2 + (6.0 - 5.9)^2 + (8.0 - 8.2)^2 + (10.0 - 10.1)^2 \right) = \frac{1}{5} (0.25 + 0.04 + 0.01 + 0.04 + 0.01) = 0.07 $$

Compute RMSE:

$$ \text{RMSE} = \sqrt{ \text{MSE} } = \sqrt{0.07} \approx 0.265 $$

Compute $R^2$:

First, compute the total sum of squares:

$$ \bar{y} = \frac{1}{5} (2.0 + 4.0 + 6.0 + 8.0 + 10.0) = 6.0 $$

$$ \text{SS}{\text{tot}} = \sum{i=1}^{5} ( y_i - \bar{y} )^2 = (2.0 - 6.0)^2 + (4.0 - 6.0)^2 + (6.0 - 6.0)^2 + (8.0 - 6.0)^2 + (10.0 - 6.0)^2 = 16 + 4 + 0 + 4 + 16 = 40 $$

Compute the residual sum of squares:

$$ \text{SS}{\text{res}} = \sum{i=1}^{5} ( y_i - \hat{y}_i )^2 = 0.25 + 0.04 + 0.01 + 0.04 + 0.01 = 0.35 $$

Compute $R^2$:

$$ R^2 = 1 - \frac{ \text{SS}{\text{res}} }{ \text{SS}{\text{tot}} } = 1 - \frac{0.35}{40} = 1 - 0.00875 = 0.99125 $$

Interpretation:

Negative $R^2$ Values

Although $R^2$ typically ranges from 0 to 1, it can be negative when the model is worse than simply predicting the mean of the observed data. This can happen if the residual sum of squares is greater than the total sum of squares.

Example:

Suppose we have a model that makes poor predictions resulting in a high residual sum of squares:

Compute $R^2$:

$$ R^2 = 1 - \frac{50}{40} = 1 - 1.25 = -0.25 $$

Interpretation:

Practical Considerations

Table of Contents

  1. Classification Metrics
    1. Confusion Matrix
    2. Key Metrics Derived from the Confusion Matrix
      1. Accuracy
      2. Precision
      3. Recall (Sensitivity or True Positive Rate)
      4. Specificity (True Negative Rate)
      5. F1 Score
    3. Intuitive Explanation of Precision and Recall
    4. Example Calculation
    5. Receiver Operating Characteristic (ROC) Curve and AUC
      1. Interpretation
    6. Precision-Recall Curve and AUC
  2. Regression Metrics
    1. Evaluating Regression Models
    2. Visualizing Regression Performance
    3. Regression Evaluation Metrics
      1. Mean Absolute Error (MAE)
      2. Mean Squared Error (MSE)
      3. Root Mean Squared Error (RMSE)
      4. Coefficient of Determination ($R^2$)
      5. Adjusted $R^2$
    4. Comparing Regression Metrics
      1. Sensitivity to Outliers
      2. Units and Interpretability
      3. Usage in Model Evaluation
    5. Example Calculation
    6. Negative $R^2$ Values
    7. Practical Considerations