Last modified: September 21, 2024

This article is written in: 🇺🇸

Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) is a fundamental concept in statistics, explaining why the distribution of sample means approximates a normal distribution, often known as the bell curve, as the sample size becomes larger, irrespective of the population's original distribution.

Mathematical Background

Formal Description

Let $X_1, X_2, \ldots, X_n$ be a sequence of independent and identically distributed random variables, each with a mean $\mu$ and a variance $\sigma^2$.

As $n$ (the sample size) tends to infinity, the distribution of the standardized sum

$$\frac{X_1 + X_2 + \ldots + X_n - n\mu}{\sigma\sqrt{n}}$$

converges in distribution to a standard normal distribution. Mathematically, this is expressed as:

$$P\left(\frac{X_1 + X_2 + \ldots + X_n - n\mu}{\sigma\sqrt{n}} \leq a\right) \to \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{a} e^{-x^2/2} \, dx$$

Key Points:

  1. "Large" is typically considered to be a sample size of 30 or more, though this can vary based on the distribution's initial shape.
  2. Samples should be independent of each other.
  3. It's the distribution of the means (and other statistics like sum and percentage) of these samples that becomes normal, not the distribution of the individual data points (it can still be skewed).

Implications and Applications

The Central Limit Theorem (CLT) enables us to approximate probabilities and percentages for large samples using the normal distribution. This has wide-ranging implications, particularly in statistical analysis and data inference. For example:

Key points regarding the CLT:

Limitations

Visualization

A histogram of the sample means will tend to form a bell-shaped curve as the number of samples increases, reflecting the normal distribution predicted by the CLT.

Data Generation

The plot below shows the non-normal exponential distribution, which is right-skewed:

output(8)

Distribution of Sample Means

The second plot demonstrates the distribution of sample means, where the sample means form a bell-shaped (approximately normal) distribution, even though the original population is non-normal. This illustrates the Central Limit Theorem in action.

output(9)

Standardizing Using CLT

When standardizing a sample statistic, we use the following formula to calculate the z-score:

$$ z = \frac{{\text{statistic} - \text{expected value}}}{{\text{Standard Error (SE) of the statistic}}} $$

Step-by-Step Example

Let’s assume we are sampling incomes with the following population parameters:

Step 1: Calculate the Standard Error of the Sample Mean

The standard error of the sample mean ($SE(\bar{x}_n)$) is calculated using the formula:

$$ SE(\bar{x}_n) = \frac{\sigma}{\sqrt{n}} $$

Where:

For example, if we take a sample size of $n = 100$, we can calculate the standard error as follows:

$$ SE(\bar{x}_{100}) = \frac{38,000}{\sqrt{100}} = \frac{38,000}{10} = 3,800 $$

Step 2: Calculate the z-score

Once we have the standard error, we can calculate the z-score, which measures how far the sample statistic (e.g., sample mean) is from the expected value (population mean), in terms of standard errors. The z-score formula is:

$$ z = \frac{{\bar{x} - \mu}}{{SE(\bar{x}_n)}} $$

Where:

For instance, if the sample mean $\bar{x}$ is 70,000, the z-score would be:

$$ z = \frac{{70,000 - 67,000}}{{3,800}} = \frac{3,000}{3,800} = 0.79 $$

This means the sample mean is 0.79 standard errors above the population mean.

Example: Applying CLT

Consider a scenario where the heights of a certain plant species are normally distributed with a population mean $\mu = 15$ cm and a population standard deviation $\sigma = 3$ cm. We will analyze random samples of different sizes and calculate the probability that the sample mean falls between 14 cm and 16 cm.

Definitions:

We will calculate the probability that the sample mean is between 14 cm and 16 cm for various sample sizes.

Step 1: Sample Size of 16 Plants

Estimate the probability that the sample mean height of 16 plants lies between 14 cm and 16 cm.

I. Calculate the Standard Error of the Mean (SEM):

$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{16}} = \frac{3}{4} = 0.75 $$

II. Calculate the Z-scores for 14 cm and 16 cm:

$$ Z_{14} = \frac{14 - 15}{0.75} = \frac{-1}{0.75} = -1.33 $$

$$ Z_{16} = \frac{16 - 15}{0.75} = \frac{1}{0.75} = 1.33 $$

III. Find the probabilities associated with the Z-scores:

Using standard normal distribution tables (or a calculator):

IV. Calculate the probability that the sample mean lies between 14 and 16 cm:

$$ P(14 \leq \bar{X} \leq 16) = P(Z \leq 1.33) - P(Z \leq -1.33) $$

$$ P(14 \leq \bar{X} \leq 16) = 0.9082 - 0.0918 = 0.8164 $$

Thus, the probability is approximately 81.64%.

Step 2: Sample Size of 64 Plants

Estimate the probability that the sample mean height of 64 plants lies between 14 cm and 16 cm.

I. Calculate the Standard Error of the Mean (SEM):

$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{64}} = \frac{3}{8} = 0.375 $$

II. Calculate the Z-scores for 14 cm and 16 cm:

$$ Z_{14} = \frac{14 - 15}{0.375} = \frac{-1}{0.375} = -2.67 $$

$$ Z_{16} = \frac{16 - 15}{0.375} = \frac{1}{0.375} = 2.67 $$

III. Find the probabilities associated with the Z-scores:

Using standard normal distribution tables (or a calculator):

IV. Calculate the probability that the sample mean lies between 14 and 16 cm:

$$ P(14 \leq \bar{X} \leq 16) = P(Z \leq 2.67) - P(Z \leq -2.67) $$

$$ P(14 \leq \bar{X} \leq 16) = 0.9962 - 0.0038 = 0.9924 $$

Thus, the probability is approximately 99.24%.

Step 3: Sample Size of 144 Plants

Estimate the probability that the sample mean height of 144 plants lies between 14 cm and 16 cm.

I. Calculate the Standard Error of the Mean (SEM):

$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{144}} = \frac{3}{12} = 0.25 $$

II. Calculate the Z-scores for 14 cm and 16 cm:

$$ Z_{14} = \frac{14 - 15}{0.25} = \frac{-1}{0.25} = -4 $$

$$ Z_{16} = \frac{16 - 15}{0.25} = \frac{1}{0.25} = 4 $$

III. Find the probabilities associated with the Z-scores:

Using standard normal distribution tables (or a calculator):

IV. Calculate the probability that the sample mean lies between 14 and 16 cm:

$$ P(14 \leq \bar{X} \leq 16) = P(Z \leq 4) - P(Z \leq -4) $$

$$ P(14 \leq \bar{X} \leq 16) = 0.99997 - 0.00003 = 0.99994 $$

Thus, the probability is approximately 99.99%.

Step 4: Unknown Population Distribution

Estimate the probability for the same range if the population distribution is unknown.

If the distribution of plant heights is unknown, the Central Limit Theorem assures us that the sampling distribution of the sample mean will still approximate normality as long as the sample size is sufficiently large (usually $n \geq 30$).

CLT Estimation:

Table of Contents

    Central Limit Theorem (CLT)
    1. Mathematical Background
    2. Formal Description
    3. Implications and Applications
    4. Limitations
    5. Visualization
      1. Data Generation
      2. Distribution of Sample Means
    6. Standardizing Using CLT
      1. Step-by-Step Example
    7. Example: Applying CLT
      1. Step 1: Sample Size of 16 Plants
      2. Step 2: Sample Size of 64 Plants
      3. Step 3: Sample Size of 144 Plants
      4. Step 4: Unknown Population Distribution