Skip to main content Skip to navigation

Lab session 4

Simpson’s paradox (the first exposition of which was actually due to G. U. Yule) is not really a paradox at all, just a manifestation of the fact that the effect of a variable depends crucially on what other variables are controlled for.

The name ‘Simpson’s paradox’ usually relates to discrete variables, and hence data in contingency tables (tables of counts). The ‘ecological fallacy’ phenomenon that we saw in the lecture is the same thing as Simpson’s paradox.

The file berkeley.txt contains data on admissions to UC Berkeley, for 6 academic departments in Fall 1973. [A famous, real dataset for teaching purposes!]

Let’s read the data into R:

berkeley <- read.table("berkeley.txt", header = TRUE)
attach(berkeley)      
##  that's so that R will look for variables here automatically 
##  without berkeley$...  

Let’s first just take a look at the data:

berkeley  

Now let’s fit a binomial logistic regression in which the only predictor is Sex:

admissions <- cbind(Admitted, NotAdmitted) ## the binomial response counts
model0 <- glm(admissions ~ 1, family = binomial)
model1 <- update(model0, . ~ . + Sex)
anova(model0, model1)
summary(model1)  

You should find that the apparent effect is positive for males, and clearly significant. On aggregate, 30% of females and 45% of males were admitted: e.g.,

prop.table(colSums(admissions[Sex == "Male", ]))  

But notice the widely differing relative application rates of males and females to the six different departments. Could it be that females tend to apply to ‘difficult’ departments? Answer this by estimating the sex effect controlling for (i.e., within) department:

model2 <- update(model1, . ~ . + Department)
summary(model2)  

This model still shows some lack of fit (note the fairly high deviance, relative to the residual degrees of freedom), but it now shows a slight negative effect of being male (though not significantly different from zero: see the standard errors by using summary(model2)).

Lack of fit is evident only in Department I, where females in fact have a rather high rate of admission (look at residuals(model2)). With Department I omitted, we achieve a simple model that fits the data well, and in which there is no significant sex effect:

model3 <- update(model2, subset = (Department != "I")) 
summary(model3)  
A simple ‘graphical’ representation of this model is
Sex — Department — Admission

i.e., admission is independent of sex, within any given department.