0% found this document useful (0 votes)
12 views26 pages

8_1_categorical_data_ninell

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

8_1_categorical_data_ninell

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chi-Square, Likelihood,

Effect sizes, and


Categorical Data Analysis

Lecture 14
Empirical Methods 2 & Theory of Science
29.10.2024 2

Last time
• Recap Outliers
• Problems with Null Hypothesis
Significance Testing (NHST)
• Effect sizes (Cohen’s d)
• Correlation (Pearson’s r)
29.10.2024 3

Today :)
• 'Families’ of statistical relationships and
their associate tests and metrics
• Statistical Analysis of Categorical Variables
• Chi-Square Test
• Categorical effect sizes: Odds Ratio
29.10.2024 4

Today :)
• 'Families’ of statistical relationships and
their associate tests and metrics
• Statistical Analysis of Categorical Variables
• Chi-Square Test
• Categorical effect sizes: Odds Ratio
29.10.2024 5

Recap: Why Are There Different Statistical Tests?


Key Point: Different scenarios require different tests because of assumptions
about the data and what we are trying to compare.
• Assumptions: Tests make different assumptions about data (e.g.,
normality, equal variances, paired or independent samples).
• Data Types: Different tests are needed for continuous vs. categorical data,
or parametric vs. non-parametric situations.
• Number or Size of Groups: Tests vary depending on whether you're
comparing two groups (t-tests) or multiple groups (ANOVA).
29.10.2024 6

*
Families of statistical relationships Predictor (Independent Variable):
The variable you manipulate or consider
Data Type Categorical Predictor Continuous Predictor as the cause or influencer. For example,
(Outcome) (two, three groups) (many “groups”) in an experiment studying the effect of
study hours on test scores, "study hours"
Categorical Chi-square test, Odds Ratio, NHST, Logistic regression, would be the predictor.
Outcome Fisher's exact discriminant analysis
Outcome (Dependent Variable):
Regression, Correlation The variable that changes in response to
Continuous the predictor. In the same example, "test
(Pearson/Spearman), Cohen’s d,
Outcome t-test, ANOVA scores" would be the outcome, as they
depend on the amount of study time.
29.10.2024 7

Means of groups (value),


t-test (significance of means difference)
Cohen’s d (effect size)

r (degree of relatedness) Proportion of groups,


Linear regression (best line), Chi square (significance of groups),
r2 (effect of best line) Odd’s ratio (effect of grouping)
29.10.2024 8

Today :)
• 'Families’ of statistical relationships and
their associate tests and metrics
• Statistical Analysis of Categorical Variables
• Chi-Square Test
• Categorical effect sizes: Odds Ratio
29.10.2024 9

What is Categorical Data? Discuss!


29.10.2024 10

What is Categorical Data?


• Categorical Designations Are All Around Us
> Many social labels, e.g. rich/poor, lucky/unlucky,
sympathetic/hostile are categorical in nature.
> Categorical analyses help us explore relationships
between these types of labels.
> The tests are different but the logic behind it is not new: it’s
an adaptation of the ones you have already met :)
• Characteristics of Categorical Variables
> Categorical variables have a limited set of values, unlike
continuous variables that can vary widely.
> Because categorical data doesn’t vary continuously, it
lacks a clear measure of variance.
29.10.2024 11

Today :)
• 'Families’ of statistical relationships and
their associate tests and metrics
• Statistical Analysis of Categorical Variables
• Chi-Square Test
• Categorical effect sizes: Odds Ratio
29.10.2024 12

* Chi-Square Test
• Chi-Square Test: Origins and Applications
> Developed by Karl Pearson around 1900, the chi-square test examines
whether observed frequencies differ from expected values.
> This is the same Pearson known for, Pearson’s r.
> He was controversially known for promoting eugenics & scientific racism.
• Primary Uses of the Chi-Square Test
> Goodness of Fit: To determine if observed frequencies match an
expected distribution, e.g. determine if the color distribution of M&Ms in a
bag matches the company’s claimed proportions.
> Test of Independence: To assess whether two categorical variables are
independent or associated, e.g. examine if there is a relationship between
sex (male, female) and preference for a type of drink (tea, coffee)
29.10.2024 13

*
Calculation
Formula:
where
O = Observed value.
E = Expected value.
How do we calculate E?
29.10.2024 14

1. Expected vs. Observed Frequencies


A meme creator wants to check if the distribution of the types of memes shared on
social media matches their expected preferences based on previous data.
We calculate the expected frequencies usually ourselves!
> E.g. for 200 memes, we do: 200 x expected preference = expected frequency
E.g. Funny: 200 x 0.5 = 100, Relatable: 200 x 0.3 = 60, and so on.

Meme Type Expected Pref. Observed Freq. Expected Freq.


Funny 50% 120 100
Relatable 30% 50 60
Political 10% 20 20
Inspirational 10% 10 20
29.10.2024 15

1. Expected vs. Observed Frequencies


Now, we need to do the actual chi-square calculation:

Meme Type Expected Pref. Observed Freq. Expected Freq. (O-E) (O-E)2 (O-E)2 / E

Funny 50% 120 100 20 400 4


Relatable 30% 50 60 -10 100 0.6
Political 10% 20 20 0 0 0
Inspirational 10% 10 20 -10 100 0.2

1. Adding the values (because of ∑, the sum): 𝜒2 = 4 + 0.6 + 0 + 0.2 = 4.8


2. Determine degrees of freedom (df) = num. of outcomes -1 = 4 - 1 = 3
3. Look up the critical value for 3 df at our significance level (e.g. 0.05) in a Chi-Square
Table (google!). If your calculated 𝜒2 exceeds this critical value, you reject the null hypothesis,
suggesting that the observed distribution of meme types does not match the expected
distribution, if it’s lower, you fail to reject the null hypothesis :) (I get 7.81, so fail reject.)
29.10.2024 16

The Chi-Square Distribution


• The 𝜒² distribution is always positive because it
represents squared differences.
• The degrees of freedom (df) determine the “center”
or expected shape of the distribution.
• The shape of the 𝜒² curve depends on the degrees
of freedom—similar to other distributions.
• For large degrees of freedom, the 𝜒² distribution
starts to look more like the normal distribution.
• This is because a categorical variable with a large number of
categories begins to resemble a continuous variable.
29.10.2024 17
29.10.2024 18

*
Requirements for Using Chi-Square
• Observations must be independent (e.g., Student A shouldn't
know or influence Student B’s survey responses).
• No cell frequency should be zero in the contingency table.
• At least 80% of cell frequencies should be greater than five
(some recommend ten as a minimum for accuracy).
• The total number of observations should ideally exceed 50
(at minimum, more than 20) to ensure reliable results.
Why? When you look at the distribution (previous slide), you see
that low df -> already large changes in p with small changes in 𝜒²
29.10.2024 19

2. Test for Independence


Are two categorical variables related?
Examples:
> Are people who enjoy outdoor activities also likely to prefer eco-friendly products?
> Do people who like cheese also tend to like tomatoes?
> Are people with certain political beliefs also more likely to hold specific social views?
Procedure:
> Make contingency table to see observed frequencies for each combination of categories.
> Calculate expected values for each cell as if the variables were independent.
> Compare observed values to expected values to see if the diff. are statistically significant.
29.10.2024 20

2. Test for Independence — Calculation of Expected Value


A survey was conducted among two age groups—teens & adults—to see if there
is a relationship between meme preference (Image / Video) and age group.

Prefer Image Prefer Video Total

Teens 30 20 50 Ei,j = (Row Total) x (Column Total)


Adults 10 40 50 Overall Total
Total 40 60 100

Prefer Image Prefer Video Expected teens to prefer image:


(Expected) (Expected)
= 50 (teens) x 40 (image)
Teens 20 30 100 (overall)
Adults 20 30

And then you can do the same as before (get df, look up in a table) again :)
29.10.2024 21

Exercise: Food at KUA :)


We want to know if people who like the products at Wicked Rabbit (veggie)
also like those at Folkekøkken (omni).
Veg+ Veg- Total
We sample 270 random customers
and want to know if there is a relationship Omni+ 122 32 154

at the alpha = .05 level. Omni- 73 43 116

Calculate 𝜒²! Total 195 75 270


29.10.2024 22

Exercise: Food at KUA :)


1. Calculate expected values (see right). Veg+ Veg- Total

2. Calculate chi-square. Omni+ 122 (111.22) 32 (42.78) 154

3. Calculate the degrees of freedom. Omni- 73 (83.78) 43 (32.22) 116

4. Determine critical value in a table. Total 195 75 270

Solution:
The chi-square statistic is 8.7514, with df = 3, the critical value is 7.815.
8.7514 > 7.815 so we reject the null hypothesis.
29.10.2024 23

Today :)
• 'Families’ of statistical relationships and
their associate tests and metrics
• Statistical Analysis of Categorical Variables
• Chi-Square Test
• Categorical effect sizes: Odds Ratio
29.10.2024 24

*
Effect Sizes of Categorical Variables — Odds Ratio
Definition: The odds ratio (OR) is a measure to determine the strength of
association between two categorical variables, commonly in the context of a
2x2 contingency table. It’s often used in studies looking at the association
between an exposure and an outcome, such as in medical or social science
research.
Veg+ Veg-
Odds (omni+, veg) = 122 / 32 = 3.81
Omni+ 122 32
Odds (omni-, veg) = 73 / 43 = 1.70
Odds ratio = 3.81 / 1.70 = 2.24 Omni- 73 43

> You are 2.24 more likely to like the omni food if you also like the veggie one.
29.10.2024 25

Effect Sizes of Categorical Variables — Odds Ratio


Alternatively, you calculate (A x D) / (B x C). Veg+ Veg-
> (122 x 73) / (32 x 43) = 2.24. Omni+ A B
Omni- C D
Interpreting Odds Ratios
● OR = 1: There is no association between exposure and outcome
(odds are the same in both groups).
● OR > 1: There is a positive association between exposure and
outcome (exposure is associated with higher odds of the outcome).
● OR < 1: There is a negative association between exposure and
outcome (exposure is associated with lower odds of the outcome).
29.10.2024 26

Thanks! :)

You might also like