“Am I looking at the same data?”

Wed, 03 Mar 2021 00:00:00 +0000

Introduction

In data analysis, numerical results often assure us that they should make logical sense, which usually is apparent to us. However, there are some cases where this observation may no longer hold. Sometimes we observe certain trends from the data in one perspective, and those trends change disappear or reverse when we look at the data from another perspective. One such instance is the Simpson paradox, which does deserve appreciation and further investigation from us, especially those who pride themselves in their love and skill for data analysis. Indeed, the Simpson paradox has many real-world examples from why overall US median wage has risen when median wage for individual income groups has declined over the same period to why gender bias in admission only presents in overall data but not in department-wise data.

Let’s start with a simple example.

A simple example on university admission data

We have the hypothetical admission data of a university that received about 2000 applications for two departments A, B, which I imported using R (R Core Team 2020), tidyverse (Wickham et al. 2019), and scale package to a dataframe called admit_df. admit_df is printed as below.

gender	admit	count	dept
Male	Admitted	800	A
Male	Rejected	135	A
Male	Admitted	42	B
Male	Rejected	58	B
Female	Admitted	602	A
Female	Rejected	98	A
Female	Admitted	154	B
Female	Rejected	146	B

Let’s look at the admission status profile by gender and department. As dept, gender, admit are all categorical features, we can use the group_by(), summarize() and mutate on admit_df to calculate the proportion of Admitted and Rejected for the respective gender by department. I also used ggplot in R to create the corresponding visualization. This visualization and subsequent visualization are not displayed in this blog post for better readability. The detail code can be found here.

separate_dept <- admit_df %>%
group_by(dept, gender, admit) %>%
summarize(count = sum(count)) %>%
mutate(prop = count / sum(count))

As can be seen, for this hypothetical university, both department A and department B seem to favor female candidates with the admission rate for females is higher than that for males for both departments. Assuming that the university has only 2 departments A, B, it seems sensible to conclude that the admission rate for females at this university is higher than that of males. Or does it? Let’s check this by plotting a similar plot for the aggregate data as follows.

aggregate_dept <- admit_df %>%
group_by(gender, admit) %>%
summarize(count = sum(count)) %>%
mutate(prop = count / sum(count))

Interestingly, under the aggregate data, the admission rate for females ((75.6%)) is lower than that for males ((81.4%)). What is going on? Am I looking at the same data? Has there been a mistake? However, this turns out to be one of the classic cases of Simpson paradox.

Simpson paradox

Simpson paradox, or Yule-Simpson effect, is a phenomenon in statistical studies whereby trends that appear in different groups disappear or reverse when we aggregate data or vice versa. The paradox is named after Edward H. Simpson, who mentioned it in his 1951 paper (Simpson 1951). However, Simpson was not the first to discover the concept, Karle Pearson et al. and Udny Yule has described the phenomenon earlier in 1899 (Pearson, Lee, and Bramley-Moore 1899) and 1903 (Yule 1900) respectively, hence Yule in Yule-Simpson. The term Simpson paradox became more popular when Colin R.Blyth mentioned it in his 1972 paper (Blyth 1972).

Indeed, in the context of our example, the bias toward female applicants during the admission process observed at individual departments not only disappears but also reverses for the aggregate data where the admission rate for male applicants is higher. Why is that so?

Making sense of the paradox

What we have not considered is that the breakdown of applicants by departments are different between gender. To obtain this information, I wrangled admit_df using group_by() on gender and dept instead of gender and admit as below.

aggregate_admit <- admit_df %>%
group_by(gender, dept) %>%
summarize(count = sum(count)) %>%
mutate(prop = count / sum(count))

From Figure 3 above, the proportion of male candidates applying for department A is at (90.3%), which is much higher than that of female candidates. This translates to more male applicants for department A than female as the number of female and male applicants in the data are comparable. With the high admission rate of (85.6%) in Figure 1, male applicants for department A there would be accepted at a high number, which outweighs the lower number of accepted male applicants for department B.

Thus, dept can be thought of as the hidden or confounding variable that affects the relationship between gender and admit.

Also, mathematically, we can use the Law of total probability to express this in one expression as below, where we can see how the admission rates in Figure 2 are linked or reconciled with those in Figure 1 and 3.

Further discussion and Takeaways

Now, let’s imagine you only have data for Figure 2 and do not have access to department data yet. After reading this post, would you trust the trend of a male preference in admission in the school above? I hope not. Rather, before we put our faith in any analytical or statistical findings, we should first try to understand and think critically about data, from how it was collected or sampled to whether it is representative of the population. If data is observational, what could be potential unobserved variables that could affect the relationship you are trying to understand in the data (i.e confounding variables or confounders). In the example above, it was dept. However, was dept the only confounder in this case? How many more should we find? The answer on how many confounders are considered sufficient is very much application-based and arbitrary. One way to circumvent this is to conduct randomized studies to cancel out the effect of confounders.

All these considerations barely touch the surface of the complexity and subtlety present in a real-world problem. In this post, I have abstracted certain complexity away to facilitate better illustration of the concept. However, I hope the post still demonstrates to you the fascinating findings from the Simpson paradox, which shows that statistical associations are not necessarily immutable and may vary depending on the set of controlled variables (Carlson 2019). It may not be intuitive at times, but when you develop a sober and critical sense of viewing the data and can identify this paradox at work and avoid the common intuitive pitfalls, it could be quite rewarding.

References

Inspiration from Lab1 assignment of module DSCI_554 of Dr. Gilberto Alexi Rodriguez Arelis

Inspiration from (Grigg 2018)

Blyth, Colin R. 1972. “On Simpson’s Paradox and the Sure-Thing Principle.” Journal of the American Statistical Association 67 (338): 364–66.

Carlson, B. W. 2019. “Simpson’s Paradox.” Edited by Encyclopedia Britannica. https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765.

Grigg, Tom. 2018. “Simpson’s Paradox and Interpreting Data.” https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765.

Pearson, Karl, Alice Lee, and Leslie Bramley-Moore. 1899. “VI. Mathematical Contributions to the Theory of Evolution.-VI. Genetic (Reproductive) Selection: Inheritance of Fertility in Man, and of Fecundity in Thoroughbred Racehorses.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, no. 192: 257–330.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Simpson, Edward H. 1951. “The Interpretation of Interaction in Contingency Tables.” Journal of the Royal Statistical Society: Series B (Methodological) 13 (2): 238–41.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Yule, George Udny. 1900. “VII. On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 194 (252-261): 257–319.