Last modified: August 05, 2018
This article is written in: 🇺🇸
A normal distribution (often referred to as the normal curve or Gaussian distribution) is a continuous probability distribution that is symmetric about the mean, where most of the observations cluster around the central peak and taper off symmetrically towards both ends. Many real-world datasets such as human heights, IQ scores, and measurement errors exhibit this kind of distribution.
The probability density function (PDF) of the normal distribution is given by:
$$ f(x | \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}} e^{ -\frac{(x - \mu)^2}{2\sigma^2} } $$
where:
This formula describes the shape of the curve mathematically. The mean $\mu$ determines the center of the distribution, and the standard deviation $\sigma$ determines the width of the bell curve.
The Empirical Rule provides a rough estimate for the spread of data in a normal distribution. It applies to any dataset that approximately follows the normal distribution.
Suppose we have a dataset where the heights of fathers are normally distributed with:
According to the Empirical Rule:
These intervals can be visualized as follows:
$$ \text{Interval} \quad \mu \pm n\sigma \quad \text{Proportion of Data Contained} $$
$$ \mu \pm \sigma \quad (66.5 \text{ to } 70.1) \quad \approx 68\% $$
$$ \mu \pm 2\sigma \quad (64.7 \text{ to } 71.9) \quad \approx 95\% $$
$$ \mu \pm 3\sigma \quad (62.9 \text{ to } 73.7) \quad \approx 99.7\% $$
To compare values from different normal distributions or to work with a standardized form of a dataset, we can convert raw data values to z-scores.
The z-score is a way of describing a value in terms of how many standard deviations it is away from the mean. The formula for the z-score is:
$$ z = \frac{x - \mu}{\sigma} $$
Where:
A z-score tells us:
Suppose a father is 71.9 inches tall. We want to find his z-score given that the mean height is 68.3 inches and the standard deviation is 1.8 inches.
$$ z = \frac{71.9 - 68.3}{1.8} = \frac{3.6}{1.8} = 2 $$
This means that a height of 71.9 inches is 2 standard deviations above the mean.
Conversely, if a father is 67.4 inches tall:
$$ z = \frac{67.4 - 68.3}{1.8} = \frac{-0.9}{1.8} = -0.5 $$
This means that a height of 67.4 inches is 0.5 standard deviations below the mean.
After converting all values in a normal distribution to z-scores, we obtain the standard normal distribution, which has:
The standard normal distribution is often used in statistics because it allows for easy computation of probabilities and comparison across different datasets.
To find the proportion of data within a certain range, we can use z-scores to convert the raw data points and then look up the corresponding probabilities using a z-table or statistical software. The area under the normal curve between two z-scores represents the proportion of data that lies between those values.
To find the proportion of fathers with heights between 67.4 inches and 71.9 inches, we first compute the z-scores for these heights:
For 67.4 inches:
$$ z_{67.4} = \frac{67.4 - 68.3}{1.8} = -0.5 $$
For 71.9 inches:
$$ z_{71.9} = \frac{71.9 - 68.3}{1.8} = 2 $$
Next, using a z-table (or software), we find the area to the left of these z-scores:
Thus, the proportion of fathers with heights between 67.4 and 71.9 inches is:
$$ P(67.4 \leq \text{height} \leq 71.9) = 0.9772 - 0.3085 = 0.6687 \approx 66.87\% $$
The percentile of a value in a normal distribution tells us the percentage of the data that is less than or equal to that value. To compute percentiles, we:
Suppose we want to compute the 30th percentile of fathers' heights. Using a z-table, we find that the z-score corresponding to the 30th percentile is approximately $z = -0.52$.
To find the corresponding height, we use the z-score formula in reverse:
$$ \text{height} = \mu + z\sigma = 68.3 + (-0.52)(1.8) = 68.3 - 0.936 = 67.364 \text{ inches}. $$
Thus, the 30th percentile corresponds to a height of approximately 67.36 inches.
This process can be applied to any percentile by finding the appropriate z-score and converting it back to the original scale using the mean and standard deviation of the dataset.