Last modified: February 05, 2025

This article is written in: πŸ‡ΊπŸ‡Έ

Chi-Square Tests and Categorical Data Analysis

The chi-square (Ο‡2) test is a statistical method used to determine if there is a significant difference between expected and observed frequencies in one or more categories. It helps assess whether any observed deviations could be due to chance.

Types of Chi-Square Tests:

  1. The goodness-of-fit test determines whether an observed frequency distribution aligns with an expected distribution.
  2. The test of homogeneity assesses if different populations share the same distribution of a single categorical variable.
  3. The test of independence evaluates whether two categorical variables are independent within a single population.

Categorical Data

Categorical data involves variables that represent groupings or categories rather than numerical values. These categories are usually qualitative and can be nominal (no inherent order) or ordinal (with a logical order). Each data point falls into one and only one category.

Contingency Tables

A contingency table (also known as a cross-tabulation or crosstab) is a matrix used to display the frequency distribution of variables. It allows us to analyze the relationship between two or more categorical variables.

Example: The Titanic survival data organized into a 2Γ—4 contingency table:

First Class Second Class Third Class Crew Total
Survived a b c d S
Died e f g h D
Total 325 285 706 885 2,201

Here, a to h represent the observed counts in each category.

1. Testing Goodness-of-Fit

Hypotheses

Example: M&M Color Distribution

Suppose we want to test if the color distribution of M&Ms has changed since 2008.

2008 Expected Color Distribution:

Color Percentage (%)
Blue 24
Orange 20
Green 16
Yellow 14
Red 13
Brown 13

Observed Counts: From a sample of 410 M&Ms, we record the number of each color.

Color Count
Blue 105
Orange 91
Green 70
Yellow 50
Red 45
Brown 49

Calculating Expected Counts

For each color, calculate the expected count (Ei):

Ei=NΓ—Pi

Example for Blue M&Ms:

Eblue=410Γ—0.24=98.4

Computing the Chi-Square Statistic

The chi-square statistic is:

Ο‡2=βˆ‘i=1k(Oiβˆ’Ei)2Ei

Calculate Ο‡2 by summing over all colors.

Degrees of Freedom

Degrees of Freedom (df)=kβˆ’1

df=6βˆ’1=5

Decision Rule

Interpretation

Visualization

output(30)

Analysis Results:

This suggests that based on the sample of 410 M&Ms, the observed color distribution does not significantly differ from the expected 2008 distribution.

2. Testing Homogeneity

Hypotheses

Example: Titanic Survival by Ticket Class

We want to test whether survival rates are the same across ticket classes.

Data Summary:

Survived Died Total
First Class 203 122 325
Second Class 118 167 285
Third Class 178 528 706
Crew 212 673 885
Total 711 1,490 2,201

Calculating Expected Counts

Expected count for each cell:

Eij=(Row TotaliΓ—Column Totalj)Grand Total

Example for First Class Survivors:

E11=(325Γ—711)2,201β‰ˆ105.0

Computing the Chi-Square Statistic

Ο‡2=βˆ‘i=1rβˆ‘j=1c(Oijβˆ’Eij)2Eij

Calculate Ο‡2 by summing over all 8 cells.

Degrees of Freedom

df=(rβˆ’1)Γ—(cβˆ’1)

df=(4βˆ’1)Γ—(2βˆ’1)=3Γ—1=3

Decision Rule

Interpretation

Visualization

output(31)

Analysis Results:

This suggests that there is a significant difference in survival rates among the different ticket classes (First Class, Second Class, Third Class, and Crew) on the Titanic. The plot compares observed and expected counts for survival and death in each class, highlighting the differences between them.

3. Testing Independence

Hypotheses

Example: Gender and Voting Preference

Suppose we survey individuals to see if gender is associated with voting preference.

Data Summary:

Liberal Conservative Total
Male 40 60 100
Female 70 30 100
Total 110 90 200

Calculating Expected Counts

Eij=(Row TotaliΓ—Column Totalj)Grand Total

Example for Male Liberals:

E11=(100Γ—110)200=55

Computing the Chi-Square Statistic

Ο‡2=βˆ‘i=12βˆ‘j=12(Oijβˆ’Eij)2Eij

Calculate Ο‡2 by summing over all 4 cells.

Degrees of Freedom

df=(2βˆ’1)Γ—(2βˆ’1)=1Γ—1=1

Yates' Correction for Continuity (Optional)

For a 2Γ—2 table, apply Yates' correction to adjust for continuity:

Ο‡2=βˆ‘(|Oijβˆ’Eij|βˆ’0.5)2Eij

Decision Rule

Interpretation

Visualization

output(32)

Analysis Results:

This result suggests that gender is indeed significantly associated with voting preference based on the observed data. The plot provides a clear comparison between observed and expected counts for "Liberal" and "Conservative" preferences across genders, using a minimalistic and professional color scheme for clarity and readability.

Comparing Homogeneity and Independence Tests

Although both tests use the chi-square statistic and similar computations, they differ in their applications and interpretations.

Chi-Square Test of Homogeneity

Chi-Square Test of Independence

Key Differences

The population focus differs:

Regarding the research question:

Assumptions and Conditions

For chi-square tests to be valid, several conditions must be met:

  1. Random sampling ensures that the data is collected appropriately through random methods.
  2. The expected frequency in each cell should be at least 5.
  3. Independence of observations must be maintained.
  4. The data used should involve categorical variables.

Practical Application Steps

  1. State the hypotheses by defining H0 and HA.
  2. Collect data and organize observed frequencies into a contingency table.
  3. Calculate expected counts using the appropriate formulas based on the test.
  4. Compute the chi-square statistic by applying the chi-square formula.
  5. Determine degrees of freedom based on the dimensions of the table.
  6. Find the critical value or p-value using chi-square distribution tables or statistical software.
  7. Make a decision by comparing Ο‡calculated2 with Ο‡critical2.
  8. Interpret the results and draw conclusions in the context of the research question.

Table of Contents

  1. Categorical Data
  2. Contingency Tables
  3. 1. Testing Goodness-of-Fit
    1. Hypotheses
    2. Example: M&M Color Distribution
    3. Calculating Expected Counts
    4. Computing the Chi-Square Statistic
    5. Degrees of Freedom
    6. Decision Rule
    7. Interpretation
    8. Visualization
  4. 2. Testing Homogeneity
    1. Hypotheses
    2. Example: Titanic Survival by Ticket Class
    3. Calculating Expected Counts
    4. Computing the Chi-Square Statistic
    5. Degrees of Freedom
    6. Decision Rule
    7. Interpretation
    8. Visualization
  5. 3. Testing Independence
    1. Hypotheses
    2. Example: Gender and Voting Preference
    3. Calculating Expected Counts
    4. Computing the Chi-Square Statistic
    5. Degrees of Freedom
    6. Yates' Correction for Continuity (Optional)
    7. Decision Rule
    8. Interpretation
    9. Visualization
  6. Comparing Homogeneity and Independence Tests
    1. Chi-Square Test of Homogeneity
    2. Chi-Square Test of Independence
    3. Key Differences
  7. Assumptions and Conditions
  8. Practical Application Steps