Last modified: September 16, 2024

This article is written in: 🇺🇸

Correlation

Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. It is a fundamental concept in statistics, enabling researchers and analysts to understand how one variable may predict or relate to another. The most commonly used correlation coefficients are the Pearson correlation coefficient and the Spearman rank correlation coefficient.

Understanding Correlation

Important Note: Correlation does not imply causation. A high correlation between two variables does not mean that one variable causes changes in the other.

Pearson Correlation Coefficient ($r$)

The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. It is sensitive to outliers and assumes that the relationship is linear and that both variables are normally distributed.

Definition

The Pearson correlation coefficient $r$ is defined as:

$$ r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

Where:

Alternative Formula:

$$ r = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{n} (Y_i - \bar{Y})^2}} $$

Where:

Interpretation

General Guidelines:

Correlation Strength Range (r)
Strong Positive Correlation 0.7 ≤ r ≤ 1.0
Moderate Positive Correlation 0.3 ≤ r < 0.7
Weak Positive Correlation 0 < r < 0.3
No Correlation r = 0
Weak Negative Correlation -0.3 < r < 0
Moderate Negative Correlation -0.7 < r ≤ -0.3
Strong Negative Correlation -1.0 ≤ r ≤ -0.7

Example: Calculating Pearson's $r$

Dataset

Consider the following data on the number of hours studied ($X$) and test scores ($Y$):

Observation ($i $) Hours Studied ($X_i$) Test Score ($Y_i$)
1 1 50
2 2 60
3 3 70
4 4 80
5 5 90

Step 1: Calculate the Means

$$ \bar{X} = \frac{1}{5}(1 + 2 + 3 + 4 + 5) = \frac{15}{5} = 3 $$

$$ \bar{Y} = \frac{1}{5}(50 + 60 + 70 + 80 + 90) = \frac{350}{5} = 70 $$

Step 2: Calculate Deviations and Products

Compute $(X_i - \bar{X})$, $(Y_i - \bar{Y})$, and their products:

$i $ $X_i$ $Y_i$ $X_i - \bar{X}$ $Y_i - \bar{Y}$ $(X_i - \bar{X})(Y_i - \bar{Y})$ $(X_i - \bar{X})^2 $ $(Y_i - \bar{Y})^2 $
1 1 50 -2 -20 40 4 400
2 2 60 -1 -10 10 1 100
3 3 70 0 0 0 0 0
4 4 80 1 10 10 1 100
5 5 90 2 20 40 4 400
Sum 100 10 1000

Step 3: Compute the Pearson Correlation Coefficient

$$ r = \frac{\sum_{i=1}^{5} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{5} (X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{5} (Y_i - \bar{Y})^2}} = \frac{100}{\sqrt{10} \times \sqrt{1000}} $$

Compute the denominators:

$$ \sqrt{10} \approx 3.1623, \quad \sqrt{1000} \approx 31.6228 $$

Compute $r$:

$$ r = \frac{100}{3.1623 \times 31.6228} = \frac{100}{100} = 1 $$

Visualization

output(12)

Pearson's $r = 1 $ indicates a perfect positive linear relationship between hours studied and test scores. As the number of hours studied increases, the test score increases proportionally.

Note on Initial Error Correction

In the initial content, it was incorrectly stated that Pearson's $r$ for this data is zero, indicating no linear correlation. This is inaccurate because the data shows a clear linear relationship. The correct calculation, as shown above, yields $r = 1 $, confirming a strong positive linear correlation.

Spearman Rank Correlation Coefficient ($\rho$)

The Spearman rank correlation coefficient measures the strength and direction of the monotonic relationship between two ranked variables. It is a non-parametric measure, meaning it does not assume a specific distribution for the variables and is less sensitive to outliers.

Definition

The Spearman correlation coefficient $\rho$ is defined as:

$$ \rho = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)} $$

Where:

Calculation Steps

  1. Assign Ranks to the data points in $X$ and $Y$ separately.
  2. Compute the Differences of Ranks $d_i$.
  3. Square the Differences $d_i^2 $.
  4. Compute $\rho$ using the formula.

Example: Calculating Spearman's $\rho$

Using the same dataset:

Observation ($i $) Hours Studied ($X_i$) Test Score ($Y_i$)
1 1 50
2 2 60
3 3 70
4 4 80
5 5 90

Step 1: Assign Ranks

Since the data is already ordered, the ranks correspond to the order of observations.

$i $ $X_i$ Rank $R(X_i)$ $Y_i$ Rank $R(Y_i)$
1 1 1 50 1
2 2 2 60 2
3 3 3 70 3
4 4 4 80 4
5 5 5 90 5

Step 2: Compute Differences of Ranks

Calculate $d_i = R(X_i) - R(Y_i)$:

$i $ Rank $R(X_i)$ Rank $R(Y_i)$ $d_i$ $d_i^2 $
1 1 1 0 0
2 2 2 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
Sum 0

Step 3: Compute Spearman's $\rho$

$$ \rho = 1 - \frac{6 \times 0}{5(5^2 - 1)} = 1 - 0 = 1 $$

Interpretation

Spearman's $\rho = 1 $ indicates a perfect positive monotonic relationship between hours studied and test scores.

When to Use Spearman's $\rho$

Comparison of Pearson's $r$ and Spearman's $\rho$

Correlation of Two Random Variables

For two random variables $X$ and $Y$ with positive variances, the population correlation coefficient $\rho_{XY}$ is defined as:

$$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

Where:

Properties

Property Description
Range $-1 \leq \rho_{XY} \leq 1$
Symmetry $\rho_{XY} = \rho_{YX}$
Dimensionless The correlation coefficient is unitless.
Linearity If $\rho_{XY} = \pm 1 $, $Y$ is a perfect linear function of $X$.
Independence If $X$ and $Y$ are independent, $\rho_{XY} = 0$. However, $\rho_{XY} = 0$ does not imply independence unless the variables are jointly normally distributed.

Interpretation

Example with Random Variables

Suppose $X$ and $Y$ are random variables with the following properties:

Compute the correlation coefficient:

$$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{6}{2 \times 4} = \frac{6}{8} = 0.75 $$

Interpretation:

Important Considerations

Correlation vs. Causation

Outliers

Non-linear Relationships

Assumptions of Pearson's $r$

  1. Linearity means that the relationship between $X$ and $Y$ is straight and follows a linear pattern.
  2. Normality refers to the condition where both $X$ and $Y$ are normally distributed.
  3. Lastly, homoscedasticity implies that the variance of $Y$ remains consistent across all values of $X$.

If these assumptions are violated, Pearson's $r$ may not be an appropriate measure of correlation.

Table of Contents

  1. Understanding Correlation
  2. Pearson Correlation Coefficient ($r$)
    1. Definition
    2. Interpretation
    3. Example: Calculating Pearson's $r$
      1. Dataset
      2. Step 1: Calculate the Means
      3. Step 2: Calculate Deviations and Products
      4. Step 3: Compute the Pearson Correlation Coefficient
      5. Visualization
    4. Note on Initial Error Correction
  3. Spearman Rank Correlation Coefficient ($\rho$)
    1. Definition
    2. Calculation Steps
    3. Example: Calculating Spearman's $\rho$
      1. Step 1: Assign Ranks
      2. Step 2: Compute Differences of Ranks
      3. Step 3: Compute Spearman's $\rho$
      4. Interpretation
    4. When to Use Spearman's $\rho$
  4. Comparison of Pearson's $r$ and Spearman's $\rho$
  5. Correlation of Two Random Variables
    1. Properties
    2. Interpretation
    3. Example with Random Variables
  6. Important Considerations
    1. Correlation vs. Causation
    2. Outliers
    3. Non-linear Relationships
    4. Assumptions of Pearson's $r$