Last modified: March 23, 2026
This article is written in: 🇺🇸
Hypothesis testing is a tool in statistics that drives much of scientific research. It lets us draw conclusions about entire populations based on the information we collect from samples. You'll find it applied in many areas—from evaluating how well a new drug works in clinical trials to unraveling the mysteries of customer behavior in business analytics.
A hypothesis is a statement that might be true.
Inputs:
Output:
The p-value is the probability of observing data as extreme as, or more extreme than, the sample data, assuming the null hypothesis is true. A small p-value (typically ≤ $\alpha$) provides strong evidence against the null hypothesis.
Hypothesis testing is a structured process involving several steps:
Imagine two bags: Bag A with a mix of 5 white and 5 black marbles, and Bag B with only black marbles.
Bag A Bag B
_____ _____
/ • • \ / O O \
| • • | | O O | O = White Marble
| O O | | O O | • = Black Marble
| O O | | O O |
| • O | | O O |
\_____/ \_____/
Suspecting you have Bag B, you decide to test this hypothesis:
Drawing n marbles and finding them all black leads to calculating p-values to test these hypotheses. For Bag A, the chance of drawing a black marble is 0.5. Hence, drawing n black marbles consecutively from Bag A has a probability of $(0.5)^n$.
A smaller p-value indicates stronger evidence against $H_0$. As n increases, the likelihood that you have Bag B (only black marbles) increases.
When testing the population mean, hypothesis testing considers three possibilities, each with distinct null and alternative hypotheses:
I. Left-Tailed Test
II. Right-Tailed Test
III. Two-Tailed Test
The null hypothesis always assumes that the population mean $\mu$ equals a predetermined value $\mu_0$. The alternative hypothesis presents a contrary statement: the population mean $\mu$ is less than, greater than, or not equal to $\mu_0$.
Important Note: Left-tailed and right-tailed tests are typically used when the effect is expected to occur in only one direction or when only one-directional effects are relevant. In most research scenarios, a two-tailed test is preferred unless there's strong justification for a one-tailed test.
I. Testing the Effectiveness of a New Diet (Two-Tailed Test)
II. Evaluating Customer Service Efficiency (Left-Tailed Test)
III. Assessing the Impact of a New Teaching Method (Right-Tailed Test)
Once the data is collected and the sample statistic computed, the researcher computes the P-value.
The P-value is the probability of obtaining a measurement at least as extreme as the one we measured, under the assumption that the null hypothesis is true.
By at least as extreme, we mean a value at least as far to the left or right of the measured value.
Selecting a suitable statistical test is critical in hypothesis testing, and several factors determine the appropriate choice:
The following table summarizes some common statistical tests and their applications:
| Test | Data Type | Number of Groups | Assumptions |
| T-Test | Interval/Ratio | Two | Normally distributed, independent samples |
| Paired T-Test | Interval/Ratio | Two | Normally distributed, dependent samples |
| One-way ANOVA | Interval/Ratio | More than Two | Normally distributed, independent samples |
| Two-way ANOVA | Interval/Ratio | More than Two | Normally distributed, independent samples |
| Chi-Square Test | Categorical | Two or more | Independence between variables |
| Pearson Correlation | Interval/Ratio | Two | Normally distributed, linear relationship |
| Spearman Correlation | Ordinal | Two | Non-parametric, monotonic relationship |
| Mann-Whitney U Test | Ordinal/Continuous | Two | Non-parametric, independent samples |
| Kruskal-Wallis H Test | Ordinal/Continuous | More than Two | Non-parametric, independent samples |
| Wilcoxon Signed-Rank Test | Ordinal/Continuous | Two | Non-parametric, dependent samples |
| Friedman Test | Ordinal/Continuous | More than Two | Non-parametric, dependent samples |
An agronomist suggests that a new fertilizer increases the average yield of a particular crop to more than 2 tons per hectare. To test this claim, a study is conducted where the new fertilizer is applied to randomly selected plots. The yield of 25 plots is measured, resulting in a mean yield of 2.1 tons per hectare and a standard deviation of 0.3 tons per hectare. Is the new fertilizer effective at increasing the average yield at a significance level of $\alpha = 0.05$?
Hypothesis Setup:
Test Statistic:
For the test statistic, we use the one-sample z-test since the sample size is greater than 30:
$$z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$$
where:
Plugging in the values:
$$z = \frac{2.1 - 2}{0.3/\sqrt{25}}$$
$$z = \frac{0.1}{0.06}$$
$$z \approx 1.667$$
We look up the critical z-value for a right-tailed test at $\alpha = 0.05$, which is approximately 1.645. Since our calculated z-value of 1.667 is greater than 1.645, we reject the null hypothesis.
There is sufficient evidence at the $\alpha = 0.05$ significance level to support the claim that the new fertilizer increases the average yield of the crop to more than 2 tons per hectare.
A statistically significant result does not necessarily imply a practically meaningful one. Effect size quantifies the magnitude of the difference or relationship, independent of sample size, and helps researchers assess whether an observed effect is large enough to matter in practice.
Cohen's d is the most widely used effect size measure for comparing two means. It expresses the difference in units of the pooled standard deviation:
$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $$
where $s_p$ is the pooled standard deviation:
$$ s_p = \sqrt{\frac{(n_1 - 1) s_1^2 + (n_2 - 1) s_2^2}{n_1 + n_2 - 2}} $$
Conventional benchmarks for interpreting Cohen's d are:
| $ | d | $ | Interpretation |
| 0.2 | Small effect | ||
| 0.5 | Medium effect | ||
| 0.8 | Large effect |
Because the p-value depends on both the effect size and the sample size, a very large sample can produce a statistically significant p-value even when the effect is trivially small. Reporting the effect size alongside the p-value provides a more complete picture of the findings.