Last modified: March 26, 2020
This article is written in: 🇺🇸
The Central Limit Theorem (CLT) is a fundamental concept in statistics, explaining why the distribution of sample means approximates a normal distribution, often known as the bell curve, as the sample size becomes larger, irrespective of the population's original distribution.
Let $X_1, X_2, \ldots, X_n$ be a sequence of independent and identically distributed random variables, each with a mean $\mu$ and a variance $\sigma^2$.
As $n$ (the sample size) tends to infinity, the distribution of the standardized sum
$$\frac{X_1 + X_2 + \ldots + X_n - n\mu}{\sigma\sqrt{n}}$$
converges in distribution to a standard normal distribution. Mathematically, this is expressed as:
$$P\left(\frac{X_1 + X_2 + \ldots + X_n - n\mu}{\sigma\sqrt{n}} \leq a\right) \to \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{a} e^{-x^2/2} \, dx$$
Key Points:
The Central Limit Theorem (CLT) enables us to approximate probabilities and percentages for large samples using the normal distribution. This has wide-ranging implications, particularly in statistical analysis and data inference. For example:
Key points regarding the CLT:
A histogram of the sample means will tend to form a bell-shaped curve as the number of samples increases, reflecting the normal distribution predicted by the CLT.
The plot below shows the non-normal exponential distribution, which is right-skewed:
The second plot demonstrates the distribution of sample means, where the sample means form a bell-shaped (approximately normal) distribution, even though the original population is non-normal. This illustrates the Central Limit Theorem in action.
When standardizing a sample statistic, we use the following formula to calculate the z-score:
$$ z = \frac{{\text{statistic} - \text{expected value}}}{{\text{Standard Error (SE) of the statistic}}} $$
Let’s assume we are sampling incomes with the following population parameters:
Step 1: Calculate the Standard Error of the Sample Mean
The standard error of the sample mean ($SE(\bar{x}_n)$) is calculated using the formula:
$$ SE(\bar{x}_n) = \frac{\sigma}{\sqrt{n}} $$
Where:
For example, if we take a sample size of $n = 100$, we can calculate the standard error as follows:
$$ SE(\bar{x}_{100}) = \frac{38,000}{\sqrt{100}} = \frac{38,000}{10} = 3,800 $$
Step 2: Calculate the z-score
Once we have the standard error, we can calculate the z-score, which measures how far the sample statistic (e.g., sample mean) is from the expected value (population mean), in terms of standard errors. The z-score formula is:
$$ z = \frac{{\bar{x} - \mu}}{{SE(\bar{x}_n)}} $$
Where:
For instance, if the sample mean $\bar{x}$ is 70,000, the z-score would be:
$$ z = \frac{{70,000 - 67,000}}{{3,800}} = \frac{3,000}{3,800} = 0.79 $$
This means the sample mean is 0.79 standard errors above the population mean.
Consider a scenario where the heights of a certain plant species are normally distributed with a population mean $\mu = 15$ cm and a population standard deviation $\sigma = 3$ cm. We will analyze random samples of different sizes and calculate the probability that the sample mean falls between 14 cm and 16 cm.
Definitions:
We will calculate the probability that the sample mean is between 14 cm and 16 cm for various sample sizes.
Estimate the probability that the sample mean height of 16 plants lies between 14 cm and 16 cm.
I. Calculate the Standard Error of the Mean (SEM):
$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{16}} = \frac{3}{4} = 0.75 $$
II. Calculate the Z-scores for 14 cm and 16 cm:
$$ Z_{14} = \frac{14 - 15}{0.75} = \frac{-1}{0.75} = -1.33 $$
$$ Z_{16} = \frac{16 - 15}{0.75} = \frac{1}{0.75} = 1.33 $$
III. Find the probabilities associated with the Z-scores:
Using standard normal distribution tables (or a calculator):
IV. Calculate the probability that the sample mean lies between 14 and 16 cm:
$$ P(14 \leq \bar{X} \leq 16) = P(Z \leq 1.33) - P(Z \leq -1.33) $$
$$ P(14 \leq \bar{X} \leq 16) = 0.9082 - 0.0918 = 0.8164 $$
Thus, the probability is approximately 81.64%.
Estimate the probability that the sample mean height of 64 plants lies between 14 cm and 16 cm.
I. Calculate the Standard Error of the Mean (SEM):
$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{64}} = \frac{3}{8} = 0.375 $$
II. Calculate the Z-scores for 14 cm and 16 cm:
$$ Z_{14} = \frac{14 - 15}{0.375} = \frac{-1}{0.375} = -2.67 $$
$$ Z_{16} = \frac{16 - 15}{0.375} = \frac{1}{0.375} = 2.67 $$
III. Find the probabilities associated with the Z-scores:
Using standard normal distribution tables (or a calculator):
IV. Calculate the probability that the sample mean lies between 14 and 16 cm:
$$ P(14 \leq \bar{X} \leq 16) = P(Z \leq 2.67) - P(Z \leq -2.67) $$
$$ P(14 \leq \bar{X} \leq 16) = 0.9962 - 0.0038 = 0.9924 $$
Thus, the probability is approximately 99.24%.
Estimate the probability that the sample mean height of 144 plants lies between 14 cm and 16 cm.
I. Calculate the Standard Error of the Mean (SEM):
$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{144}} = \frac{3}{12} = 0.25 $$
II. Calculate the Z-scores for 14 cm and 16 cm:
$$ Z_{14} = \frac{14 - 15}{0.25} = \frac{-1}{0.25} = -4 $$
$$ Z_{16} = \frac{16 - 15}{0.25} = \frac{1}{0.25} = 4 $$
III. Find the probabilities associated with the Z-scores:
Using standard normal distribution tables (or a calculator):
IV. Calculate the probability that the sample mean lies between 14 and 16 cm:
$$ P(14 \leq \bar{X} \leq 16) = P(Z \leq 4) - P(Z \leq -4) $$
$$ P(14 \leq \bar{X} \leq 16) = 0.99997 - 0.00003 = 0.99994 $$
Thus, the probability is approximately 99.99%.
Estimate the probability for the same range if the population distribution is unknown.
If the distribution of plant heights is unknown, the Central Limit Theorem assures us that the sampling distribution of the sample mean will still approximate normality as long as the sample size is sufficiently large (usually $n \geq 30$).
CLT Estimation: