Covariance

Last modified: December 23, 2023

This article is written in: 🇺🇸

Covariance

Covariance is a fundamental statistical measure that quantifies the degree to which two random variables change together. It indicates the direction of the linear relationship between variables:

A positive covariance implies that as one variable increases, the other tends to increase as well.
A negative covariance suggests that as one variable increases, the other tends to decrease.
A zero covariance indicates no linear relationship between the variables.

Definition

The covariance between two random variables $X$ and $Y$ is defined as the expected value (mean) of the product of their deviations from their respective means:

$$ \text{Cov}(X, Y) = \mathbb{E}\left[ (X - \mu_X)(Y - \mu_Y) \right] $$

Where:

$\text{Cov}(X, Y)$ is the covariance between $X$ and $Y$.
$\mathbb{E}$ denotes the expected value operator.
$\mu_X = \mathbb{E}[X]$ is the mean of $X$.
$\mu_Y = \mathbb{E}[Y]$ is the mean of $Y$.

Alternative Expression

By expanding the definition and applying the linearity properties of expectation, covariance can also be expressed as:

$$ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y] $$

Derivation:

Start with the definition:

$$ \text{Cov}(X, Y) = \mathbb{E}\left[ (X - \mu_X)(Y - \mu_Y) \right] $$

Expand the product inside the expectation:

$$ \text{Cov}(X, Y) = \mathbb{E}\left[ XY - X \mu_Y - \mu_X Y + \mu_X \mu_Y \right] $$

Use the linearity of expectation:

$$ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mu_Y \mathbb{E}[X] - \mu_X \mathbb{E}[Y] + \mu_X \mu_Y $$

Recognize that $\mu_X = \mathbb{E}[X]$ and $\mu_Y = \mathbb{E}[Y]$:

$$ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mu_Y \mu_X - \mu_X \mu_Y + \mu_X \mu_Y = \mathbb{E}[XY] - \mu_X \mu_Y $$

Thus, we arrive at:

$$ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y] $$

Interpretation

Positive Covariance ($\text{Cov}(X, Y) > 0 $): Indicates that $X$ and $Y$ tend to increase or decrease together.
Negative Covariance ($\text{Cov}(X, Y) < 0 $): Indicates that when $X$ increases, $Y$ tends to decrease, and vice versa.
Zero Covariance ($\text{Cov}(X, Y) = 0 $): Suggests no linear relationship between $X$ and $Y$.

Important Note:

If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0 $.
However, a covariance of zero does not necessarily imply independence. Variables can be uncorrelated (zero covariance) but still dependent in a non-linear way.

Properties of Covariance

I. Symmetry:

$$ \text{Cov}(X, Y) = \text{Cov}(Y, X) $$

II. Linearity in Each Argument:

For constants $a$ and $b$, and random variables $X$, $Y$, and $Z$:

$$ \text{Cov}(aX + bY, Z) = a \text{Cov}(X, Z) + b \text{Cov}(Y, Z) $$

III. Covariance with Itself (Variance Relation):

The covariance of a variable with itself is the variance of that variable:

$$ \text{Cov}(X, X) = \text{Var}(X) $$

IV. Scaling:

If $a$ and $b$ are constants:

$$ \text{Cov}(aX, bY) = ab \text{Cov}(X, Y) $$

V. Addition of Constants:

Adding a constant to a variable does not affect the covariance:

$$ \text{Cov}(X + c, Y) = \text{Cov}(X, Y) $$

VI. Relationship with Correlation:

Covariance is related to the correlation coefficient $\rho_{XY}$:

$$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

Where $\sigma_X$ and $\sigma_Y$ are the standard deviations of $X$ and $Y$, respectively.

Sample Covariance

When working with sample data, the sample covariance between two variables $X$ and $Y$ is calculated as:

$$ s_{XY} = \text{Cov}(X, Y) = \frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) $$

Where:

$n$ is the number of observations.
$X_i$ and $Y_i$ are the $i$-th observations of variables $X$ and $Y$.
$\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$.

Note: The denominator $n - 1$ provides an unbiased estimate of the covariance for a sample drawn from a population.

Example: Calculating Covariance Step by Step

Let's calculate the covariance between two variables $X$ and $Y$ using the following dataset:

Observation ($i$)	$X_i$	$Y_i$
1	1	2
2	2	4
3	3	6

Step 1: Calculate the Sample Means

Compute the mean of $X$ and $Y$:

$$ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i = \frac{1 + 2 + 3}{3} = \frac{6}{3} = 2 $$

$$ \bar{Y} = \frac{1}{n} \sum_{i=1}^{n} Y_i = \frac{2 + 4 + 6}{3} = \frac{12}{3} = 4 $$

Step 2: Compute the Deviations from the Mean

Calculate $(X_i - \bar{X})$ and $(Y_i - \bar{Y})$:

$i$	$X_i$	$Y_i$	$X_i - \bar{X}$	$Y_i - \bar{Y}$
1	1	2	$1 - 2 = -1$	$2 - 4 = -2 $
2	2	4	$2 - 2 = 0 $	$4 - 4 = 0 $
3	3	6	$3 - 2 = 1$	$6 - 4 = 2 $

Step 3: Calculate the Product of Deviations

Compute $(X_i - \bar{X})(Y_i - \bar{Y})$:

$i$	$X_i - \bar{X}$	$Y_i - \bar{Y}$	$(X_i - \bar{X})(Y_i - \bar{Y})$
1	-1	-2	$(-1)(-2) = 2 $
2	0	0	$(0)(0) = 0 $
3	1	2	$(1)(2) = 2 $

Step 4: Sum the Products of Deviations

Compute the sum:

$$ \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) = 2 + 0 + 2 = 4 $$

Step 5: Calculate the Sample Covariance

Use the sample covariance formula:

$$ s_{XY} = \text{Cov}(X, Y) = \frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) $$

Since $n = 3 $:

$$ s_{XY} = \frac{1}{3 - 1} \times 4 = \frac{1}{2} \times 4 = 2 $$

Interpretation:

The positive covariance of $2 $ indicates that $X$ and $Y$ tend to increase together.
Since the data points lie perfectly on a straight line ($Y = 2X$), the covariance reflects a perfect positive linear relationship.

Step 6: Calculate the Variances (Optional)

For completeness, calculate the variances of $X$ and $Y$:

Variance of $X$

$$ s_{XX} = \text{Var}(X) = \frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})^2 $$

Compute $(X_i - \bar{X})^2 $:

$i$	$X_i - \bar{X}$	$(X_i - \bar{X})^2 $
1	-1	$(-1)^2 = 1$
2	0	$(0)^2 = 0 $
3	1	$(1)^2 = 1$

Sum:

$$ \sum_{i=1}^{n} (X_i - \bar{X})^2 = 1 + 0 + 1 = 2 $$

Compute variance:

$$ s_{XX} = \frac{1}{2} \times 2 = 1 $$

Variance of $Y$

Similarly, compute $(Y_i - \bar{Y})^2 $:

$i$	$Y_i - \bar{Y}$	$(Y_i - \bar{Y})^2 $
1	-2	$(-2)^2 = 4 $
2	0	$(0)^2 = 0 $
3	2	$(2)^2 = 4 $

Sum:

$$ \sum_{i=1}^{n} (Y_i - \bar{Y})^2 = 4 + 0 + 4 = 8 $$

Compute variance:

$$ s_{YY} = \text{Var}(Y) = \frac{1}{2} \times 8 = 4 $$

Step 7: Calculate the Correlation Coefficient (Optional)

The correlation coefficient $r_{XY}$ standardizes the covariance, providing a dimensionless measure of the strength and direction of the linear relationship:

$$ r_{XY} = \frac{s_{XY}}{\sqrt{s_{XX} \times s_{YY}}} = \frac{2}{\sqrt{1 \times 4}} = \frac{2}{2} = 1 $$

Interpretation:

A correlation coefficient of $1$ indicates a perfect positive linear relationship between $X$ and $Y$.
This makes sense since $Y = 2X$ in the dataset.

Plot:

output(13)

Limitations of Covariance

I. Scale Dependence:

Covariance values are affected by the units of measurement of the variables.
For example, measuring height in meters vs. centimeters will change the covariance.

II. Comparison Difficulties:

Because covariance is not standardized, comparing covariances across different datasets or variables with different scales is challenging.
This is why the correlation coefficient, which standardizes covariance, is often used.

III. Not a Measure of Strength:

The magnitude of covariance does not directly indicate the strength of the relationship.
A large covariance could be due to large variances rather than a strong relationship.

IV. Linear Relationships Only:

Covariance measures only linear relationships.
It does not capture non-linear dependencies between variables.