Last modified: March 30, 2020
This article is written in: 🇺🇸
Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. It is a fundamental concept in statistics, enabling researchers and analysts to understand how one variable may predict or relate to another. The most commonly used correlation coefficients are the Pearson correlation coefficient and the Spearman rank correlation coefficient.
Understanding Correlation
- A positive correlation occurs when, as one variable increases, the other tends to rise as well.
- In contrast, a negative correlation happens when an increase in one variable results in the other decreasing.
- Lastly, a zero correlation indicates that there is no linear relationship between the variables.
Important Note: Correlation does not imply causation. A high correlation between two variables does not mean that one variable causes changes in the other.
Pearson Correlation Coefficient ()
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. It is sensitive to outliers and assumes that the relationship is linear and that both variables are normally distributed.
Definition
The Pearson correlation coefficient is defined as:
Where:
- is the covariance between variables and .
- and are the standard deviations of and , respectively.
Alternative Formula:
Where:
- is the number of observations.
- and are the -th observations of and .
- and are the sample means of and .
Interpretation
- : Perfect positive linear correlation.
- : Perfect negative linear correlation.
- : No linear correlation.
General Guidelines:
Correlation Strength | Range (r) |
Strong Positive Correlation | 0.7 ≤ r ≤ 1.0 |
Moderate Positive Correlation | 0.3 ≤ r < 0.7 |
Weak Positive Correlation | 0 < r < 0.3 |
No Correlation | r = 0 |
Weak Negative Correlation | -0.3 < r < 0 |
Moderate Negative Correlation | -0.7 < r ≤ -0.3 |
Strong Negative Correlation | -1.0 ≤ r ≤ -0.7 |
Example: Calculating Pearson's
Dataset
Consider the following data on the number of hours studied () and test scores ():
Observation () | Hours Studied () | Test Score () |
1 | 1 | 50 |
2 | 2 | 60 |
3 | 3 | 70 |
4 | 4 | 80 |
5 | 5 | 90 |
Step 1: Calculate the Means
Step 2: Calculate Deviations and Products
Compute , , and their products:
1 | 1 | 50 | -2 | -20 | 40 | 4 | 400 |
2 | 2 | 60 | -1 | -10 | 10 | 1 | 100 |
3 | 3 | 70 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 80 | 1 | 10 | 10 | 1 | 100 |
5 | 5 | 90 | 2 | 20 | 40 | 4 | 400 |
Sum | 100 | 10 | 1000 |
Step 3: Compute the Pearson Correlation Coefficient
Compute the denominators:
Compute :
Visualization
Pearson's indicates a perfect positive linear relationship between hours studied and test scores. As the number of hours studied increases, the test score increases proportionally.
Note on Initial Error Correction
In the initial content, it was incorrectly stated that Pearson's for this data is zero, indicating no linear correlation. This is inaccurate because the data shows a clear linear relationship. The correct calculation, as shown above, yields , confirming a strong positive linear correlation.
Spearman Rank Correlation Coefficient ()
The Spearman rank correlation coefficient measures the strength and direction of the monotonic relationship between two ranked variables. It is a non-parametric measure, meaning it does not assume a specific distribution for the variables and is less sensitive to outliers.
Definition
The Spearman correlation coefficient is defined as:
Where:
- is the difference between the ranks of and .
- and are the ranks of and , respectively.
- is the number of observations.
Calculation Steps
- Assign Ranks to the data points in and separately.
- Compute the Differences of Ranks .
- Square the Differences .
- Compute using the formula.
Example: Calculating Spearman's
Using the same dataset:
Observation () | Hours Studied () | Test Score () |
1 | 1 | 50 |
2 | 2 | 60 |
3 | 3 | 70 |
4 | 4 | 80 |
5 | 5 | 90 |
Step 1: Assign Ranks
Since the data is already ordered, the ranks correspond to the order of observations.
Rank | Rank | |||
1 | 1 | 1 | 50 | 1 |
2 | 2 | 2 | 60 | 2 |
3 | 3 | 3 | 70 | 3 |
4 | 4 | 4 | 80 | 4 |
5 | 5 | 5 | 90 | 5 |
Step 2: Compute Differences of Ranks
Calculate :
Rank | Rank | |||
1 | 1 | 1 | 0 | 0 |
2 | 2 | 2 | 0 | 0 |
3 | 3 | 3 | 0 | 0 |
4 | 4 | 4 | 0 | 0 |
5 | 5 | 5 | 0 | 0 |
Sum | 0 |
Step 3: Compute Spearman's
Interpretation
Spearman's indicates a perfect positive monotonic relationship between hours studied and test scores.
When to Use Spearman's
- When data is ordinal.
- When the relationship between variables is monotonic but not necessarily linear.
- When there are outliers that might affect Pearson's .
- When the variables are not normally distributed.
Comparison of Pearson's and Spearman's
- Pearson's measures the strength of a linear relationship.
- Spearman's measures the strength of a monotonic relationship.
- Both coefficients range from -1 to 1.
- Spearman's is less sensitive to outliers and does not require the assumption of normality.
Correlation of Two Random Variables
For two random variables and with positive variances, the population correlation coefficient is defined as:
Where:
- is the covariance between and .
- and are the standard deviations of and .
Properties
Property | Description |
Range | |
Symmetry | |
Dimensionless | The correlation coefficient is unitless. |
Linearity | If , is a perfect linear function of . |
Independence | If and are independent, . However, does not imply independence unless the variables are jointly normally distributed. |
Interpretation
- : Positive linear relationship.
- : Negative linear relationship.
- : No linear relationship.
Example with Random Variables
Suppose and are random variables with the following properties:
Compute the correlation coefficient:
Interpretation:
- A correlation coefficient of indicates a strong positive linear relationship between and .
Important Considerations
Correlation vs. Causation
- Correlation does not imply causation. A high correlation between two variables does not mean that one variable causes changes in the other.
Outliers
- Pearson's is sensitive to outliers, which can distort the correlation coefficient.
- Spearman's is more robust to outliers due to ranking.
Non-linear Relationships
- Variables can have a strong non-linear relationship but a low Pearson correlation coefficient.
- In such cases, Spearman's may detect the monotonic relationship.
Assumptions of Pearson's
- Linearity means that the relationship between and is straight and follows a linear pattern.
- Normality refers to the condition where both and are normally distributed.
- Lastly, homoscedasticity implies that the variance of remains consistent across all values of .
If these assumptions are violated, Pearson's may not be an appropriate measure of correlation.