Last modified: February 07, 2025

This article is written in: 🇺🇸

Multiple Comparisons

When conducting multiple hypothesis tests simultaneously, the likelihood of committing at least one Type I error (falsely rejecting a true null hypothesis) increases. This increase is due to the problem known as the "multiple comparisons problem" or the "look-elsewhere effect". The methods to address this issue typically involve adjustments to the significance level or p-values, and each has its advantages and disadvantages.

Data Snooping and the Multiple Testing Fallacy

Multiple Comparisons Problem

Reproducibility and Replicability Crisis

Addressing the Multiple Testing Problem

False Discovery Rate (FDR)

Using a Validation Set to Avoid Data Snooping

Family-wise Error Rate (FWER)

The family-wise error rate (FWER) is the probability of making at least one Type I error among all the tests in a family. Controlling FWER maintains overall confidence in the results when conducting multiple tests.

Bonferroni Correction

The Bonferroni correction is a common method for controlling the FWER. It adjusts the significance level (α) by dividing it by the number of tests performed (n):

$$ \alpha_{\text{{adjusted}}} = \frac{\alpha}{n} $$

This adjusted significance level is then used to compute the critical values for each test. The Bonferroni correction is inherently conservative, making it more likely to commit Type II errors (falsely accepting a false null hypothesis), especially when there are many tests or the tests are not independent. Here is an improved version of your example with better LaTeX formatting and more explanation:

Example: Bonferroni Correction

Suppose we are conducting 20 independent hypothesis tests, and the significance level for the family-wise error rate is $\alpha = 0.05$. To control for multiple comparisons using the Bonferroni correction, we adjust the significance level by dividing $\alpha$ by the number of tests.

The Bonferroni-adjusted significance level, $\alpha_{\text{adjusted}}$, is calculated as follows:

$$ \alpha_{\text{adjusted}} = \frac{\alpha}{m} = \frac{0.05}{20} = 0.0025 $$

where:

Conclusion:

After applying the Bonferroni correction, we would reject the null hypothesis for an individual test only if its p-value is less than $0.0025$. This correction helps control the family-wise error rate, reducing the chance of Type I errors (false positives) across multiple tests. However, it also makes the test more conservative, increasing the likelihood of Type II errors (false negatives).

False Discovery Rate (FDR)

In contrast to FWER, the false discovery rate (FDR) controls for the expected proportion of false positives among all rejected null hypotheses. FDR controlling procedures are generally more powerful than FWER controlling methods, making them particularly suitable for exploratory studies where the discovery of new findings is prioritized.

Benjamini-Hochberg Procedure

The Benjamini-Hochberg (BH) procedure is widely used for controlling the FDR. This method involves ordering the p-values from smallest to largest and then comparing each p-value to an adjusted significance level that depends on its rank (i) and the total number of tests (n):

$$ \alpha_{\text{{adjusted}}} = \frac{\alpha \times i}{n} $$

We reject the null hypothesis for all tests where the p-value is less than or equal to the adjusted significance level.

Example: Multiple Hypothesis Testing

Suppose we are conducting six hypothesis tests, and the p-values obtained from these tests are:

$$ {0.001, 0.008, 0.039, 0.041, 0.042, 0.06} $$

We will apply the Bonferroni-Holm correction to control the family-wise error rate at $\alpha = 0.05$. The procedure requires us to compare each ordered p-value to a sequentially adjusted significance level.

Step-by-Step Procedure:

I. Order the p-values in ascending order:

$$ 0.001, 0.008, 0.039, 0.041, 0.042, 0.06 $$

II. Adjust the significance level for each test. The adjusted significance level for the $i$-th test is calculated as:

$$ \alpha_i = \frac{\alpha}{n - i + 1} $$

where $n$ is the total number of tests (in this case, $n = 6$) and $\alpha = 0.05$.

III. Compare each p-value to its adjusted $\alpha_i$:

For $p_1 = 0.001$:

$$ 0.001 < \frac{0.05}{6} \approx 0.00833 \quad \text{(Reject $H_0$)} $$

For $p_2 = 0.008$:

$$ 0.008 < \frac{0.05}{5} = 0.01 \quad \text{(Reject $H_0$)} $$

For $p_3 = 0.039$:

$$ 0.039 > \frac{0.05}{4} = 0.0125 \quad \text{(Fail to Reject $H_0$)} $$

Since we fail to reject $H_0$ at this step, there is no need to test further hypotheses, as the procedure stops here.

output(29)

Using the Bonferroni-Holm procedure, we reject the null hypothesis for the first two tests but fail to reject for the remaining tests.

This method helps reduce the probability of Type I errors (false positives) when conducting multiple comparisons. However, as with all statistical procedures, there is a trade-off, potentially increasing the risk of Type II errors (false negatives). Researchers must balance these risks depending on the context and goals of their study.

Table of Contents

    Multiple Comparisons
    1. Data Snooping and the Multiple Testing Fallacy
      1. Multiple Comparisons Problem
      2. Reproducibility and Replicability Crisis
      3. Addressing the Multiple Testing Problem
      4. False Discovery Rate (FDR)
      5. Using a Validation Set to Avoid Data Snooping
    2. Family-wise Error Rate (FWER)
      1. Bonferroni Correction
      2. Example: Bonferroni Correction
    3. False Discovery Rate (FDR)
      1. Benjamini-Hochberg Procedure
      2. Example: Multiple Hypothesis Testing