<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts | Alexandra (Alex) Truong Hai Yen</title><link>https://alextruong.com/post/</link><atom:link href="https://alextruong.com/post/index.xml" rel="self" type="application/rss+xml"/><description>Posts</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>© 2021 Alex Truong. Powered by the Academic theme for Hugo</copyright><image><url>https://alextruong.com/media/icon_hu21da36d80af7d2e68188ec400a83d9e9_58812_512x512_fill_lanczos_center_2.png</url><title>Posts</title><link>https://alextruong.com/post/</link></image><item><title>“Am I looking at the same data?”</title><link>https://alextruong.com/post/simpson/</link><pubDate>Wed, 03 Mar 2021 00:00:00 +0000</pubDate><guid>https://alextruong.com/post/simpson/</guid><description>&lt;!-- &lt;blockquote class="twitter-tweet">&lt;p lang="fr" dir="ltr">THE SIMPSONS PARADOX&lt;br>&lt;br>H/T &lt;a href="https://twitter.com/fMRI_guy?ref_src=twsrc%5Etfw">@fMRI_guy&lt;/a> &lt;a href="https://t.co/dUbMuwWUDH">pic.twitter.com/dUbMuwWUDH&lt;/a>&lt;/p>&amp;mdash; RJ Andrews (@infowetrust) &lt;a href="https://twitter.com/infowetrust/status/984536880199876608?ref_src=twsrc%5Etfw">April 12, 2018&lt;/a>&lt;/blockquote> &lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8">&lt;/script> -->
&lt;!-- Table of Contents
- [Introduction](#introduction)
- [A simple example on university admission
data](#a-simple-example-on-university-admission-data)
- [Simpson paradox](#simpson-paradox)
- [Making sense of the paradox](#making-sense-of-the-paradox)
- [Further discussion and
Takeaways](#further-discussion-and-takeaways)
- [References](#references) -->
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>In data analysis, numerical results often assure us that they should
make logical sense, which usually is apparent to us. However, there are
some cases where this observation may no longer hold. Sometimes we
observe certain trends from the data in one perspective, and those
trends change disappear or reverse when we look at the data from another
perspective. One such instance is the Simpson paradox, which does
deserve appreciation and further investigation from us, especially those
who pride themselves in their love and skill for data analysis. Indeed,
the Simpson paradox has many real-world examples from &lt;a href="https://www.nytimes.com/2013/04/27/business/economy/wage-disparity-continues-to-grow.html?_r=2&amp;amp;" target="_blank" rel="noopener">why overall US
median wage has risen when median wage for individual income groups has
declined over the same
period&lt;/a>
to &lt;a href="https://medium.com/@dexter.shawn/how-uc-berkeley-almost-got-sued-because-of-lying-data-aaa5d641f571" target="_blank" rel="noopener">why gender bias in admission only presents in overall data but not
in department-wise
data&lt;/a>.&lt;/p>
&lt;p>Let’s start with a simple example.&lt;/p>
&lt;h2 id="a-simple-example-on-university-admission-data">A simple example on university admission data&lt;/h2>
&lt;p>We have the hypothetical admission data of a university that received
about 2000 applications for two departments A, B, which I imported using
R (R Core Team 2020), &lt;code>tidyverse&lt;/code> (Wickham et al. 2019), and &lt;code>scale&lt;/code>
package to a dataframe called &lt;code>admit_df&lt;/code>. &lt;code>admit_df&lt;/code> is printed as
below.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:center">gender&lt;/th>
&lt;th style="text-align:center">admit&lt;/th>
&lt;th style="text-align:center">count&lt;/th>
&lt;th style="text-align:center">dept&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:center">Male&lt;/td>
&lt;td style="text-align:center">Admitted&lt;/td>
&lt;td style="text-align:center">800&lt;/td>
&lt;td style="text-align:center">A&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Male&lt;/td>
&lt;td style="text-align:center">Rejected&lt;/td>
&lt;td style="text-align:center">135&lt;/td>
&lt;td style="text-align:center">A&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Male&lt;/td>
&lt;td style="text-align:center">Admitted&lt;/td>
&lt;td style="text-align:center">42&lt;/td>
&lt;td style="text-align:center">B&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Male&lt;/td>
&lt;td style="text-align:center">Rejected&lt;/td>
&lt;td style="text-align:center">58&lt;/td>
&lt;td style="text-align:center">B&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Female&lt;/td>
&lt;td style="text-align:center">Admitted&lt;/td>
&lt;td style="text-align:center">602&lt;/td>
&lt;td style="text-align:center">A&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Female&lt;/td>
&lt;td style="text-align:center">Rejected&lt;/td>
&lt;td style="text-align:center">98&lt;/td>
&lt;td style="text-align:center">A&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Female&lt;/td>
&lt;td style="text-align:center">Admitted&lt;/td>
&lt;td style="text-align:center">154&lt;/td>
&lt;td style="text-align:center">B&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">Female&lt;/td>
&lt;td style="text-align:center">Rejected&lt;/td>
&lt;td style="text-align:center">146&lt;/td>
&lt;td style="text-align:center">B&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Let’s look at the admission status profile by gender and department. As
&lt;code>dept&lt;/code>, &lt;code>gender&lt;/code>, &lt;code>admit&lt;/code> are all categorical features, we can use the
&lt;code>group_by()&lt;/code>, &lt;code>summarize()&lt;/code> and &lt;code>mutate&lt;/code> on &lt;code>admit_df&lt;/code> to calculate the
proportion of &lt;strong>Admitted&lt;/strong> and &lt;strong>Rejected&lt;/strong> for the respective gender by
department. I also used &lt;code>ggplot&lt;/code> in R to create the corresponding
visualization. This visualization and subsequent visualization are not
displayed in this blog post for better readability. The detail code can
be found
&lt;a href="https://github.ubc.ca/MDS-2020-21/DSCI_542_lab2_haiyen/blob/master/blog.Rmd" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;pre>&lt;code class="language-r">separate_dept &amp;lt;- admit_df %&amp;gt;%
group_by(dept, gender, admit) %&amp;gt;%
summarize(count = sum(count)) %&amp;gt;%
mutate(prop = count / sum(count))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="blog_files/figure-gfm/separate%20dept-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;p>As can be seen, for this hypothetical university, both department A and
department B seem to favor female candidates with the admission rate for
females is higher than that for males for both departments. Assuming
that the university has only 2 departments A, B, it seems sensible to
conclude that the admission rate for females at this university is
higher than that of males. Or does it? Let’s check this by plotting a
similar plot for the aggregate data as follows.&lt;/p>
&lt;pre>&lt;code class="language-r">aggregate_dept &amp;lt;- admit_df %&amp;gt;%
group_by(gender, admit) %&amp;gt;%
summarize(count = sum(count)) %&amp;gt;%
mutate(prop = count / sum(count))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="blog_files/figure-gfm/unnamed-chunk-3-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;p>Interestingly, under the aggregate data, the admission rate for females
((75.6%)) is lower than that for males ((81.4%)). What is going
on? Am I looking at the same data? Has there been a mistake? However,
this turns out to be one of the classic cases of Simpson paradox.&lt;/p>
&lt;h2 id="simpson-paradox">Simpson paradox&lt;/h2>
&lt;p>&lt;strong>Simpson paradox&lt;/strong>, or &lt;strong>Yule-Simpson effect&lt;/strong>, is a phenomenon in
statistical studies whereby trends that appear in different groups
disappear or reverse when we aggregate data or vice versa. The paradox
is named after Edward H. Simpson, who mentioned it in his 1951 paper
(Simpson 1951). However, Simpson was not the first to discover the
concept, Karle Pearson et al. and Udny Yule has described the phenomenon
earlier in 1899 (Pearson, Lee, and Bramley-Moore 1899) and 1903 (Yule
1900) respectively, hence Yule in Yule-Simpson. The term Simpson paradox
became more popular when Colin R.Blyth mentioned it in his 1972 paper
(Blyth 1972).&lt;/p>
&lt;p>Indeed, in the context of our example, the bias toward female applicants
during the admission process observed at individual departments not only
disappears but also reverses for the aggregate data where the admission
rate for male applicants is higher. Why is that so?&lt;/p>
&lt;h2 id="making-sense-of-the-paradox">Making sense of the paradox&lt;/h2>
&lt;p>What we have not considered is that the breakdown of applicants by
departments are different between gender. To obtain this information, I
wrangled &lt;code>admit_df&lt;/code> using &lt;code>group_by()&lt;/code> on &lt;code>gender&lt;/code> and &lt;code>dept&lt;/code> instead of
&lt;code>gender&lt;/code> and &lt;code>admit&lt;/code> as below.&lt;/p>
&lt;pre>&lt;code class="language-r">aggregate_admit &amp;lt;- admit_df %&amp;gt;%
group_by(gender, dept) %&amp;gt;%
summarize(count = sum(count)) %&amp;gt;%
mutate(prop = count / sum(count))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="blog_files/figure-gfm/unnamed-chunk-5-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;p>From Figure 3 above, the proportion of male candidates applying for
department A is at (90.3%), which is much higher than that of female
candidates. This translates to more male applicants for department A
than female as the number of female and male applicants in the data are
comparable. With the high admission rate of (85.6%) in Figure 1, male
applicants for department A there would be accepted at a high number,
which outweighs the lower number of accepted male applicants for
department B.&lt;/p>
&lt;p>Thus, &lt;code>dept&lt;/code> can be thought of as the hidden or confounding variable
that affects the relationship between &lt;code>gender&lt;/code> and &lt;code>admit&lt;/code>.&lt;/p>
&lt;p>Also, mathematically, we can use the &lt;a href="https://www.wikiwand.com/en/Law_of_total_probability" target="_blank" rel="noopener">Law of total
probability&lt;/a> to
express this in one expression as below, where we can see how the
admission rates in Figure 2 are linked or reconciled with those in
Figure 1 and 3.&lt;/p>
&lt;p>&lt;img src="blog_files/figure-gfm/simpson_latex.png" alt="">&lt;!-- -->&lt;/p>
&lt;h2 id="further-discussion-and-takeaways">Further discussion and Takeaways&lt;/h2>
&lt;p>Now, let’s imagine you only have data for Figure 2 and do not have
access to department data yet. After reading this post, would you trust
the trend of a male preference in admission in the school above? I hope
not. Rather, before we put our faith in any analytical or statistical
findings, we should first try to understand and think critically about
data, from how it was collected or sampled to whether it is
representative of the population. If data is observational, what could
be potential unobserved variables that could affect the relationship you
are trying to understand in the data (i.e confounding variables or
confounders). In the example above, it was &lt;code>dept&lt;/code>. However, was &lt;code>dept&lt;/code>
the only confounder in this case? How many more should we find? The
answer on how many confounders are considered sufficient is very much
application-based and arbitrary. One way to circumvent this is to
conduct randomized studies to cancel out the effect of confounders.&lt;/p>
&lt;p>All these considerations barely touch the surface of the complexity and
subtlety present in a real-world problem. In this post, I have
abstracted certain complexity away to facilitate better illustration of
the concept. However, I hope the post still demonstrates to you the
fascinating findings from the Simpson paradox, which shows that
statistical associations are not necessarily immutable and may vary
depending on the set of controlled variables (Carlson 2019). It may not
be intuitive at times, but when you develop a sober and critical sense
of viewing the data and can identify this paradox at work and avoid the
common intuitive pitfalls, it could be quite rewarding.&lt;/p>
&lt;h3 id="references">References&lt;/h3>
&lt;p>Inspiration from
&lt;a href="https://github.ubc.ca/MDS-2020-21/DSCI_554_exper-causal-inf_students/blob/master/release/lab1/lab1.Rmd" target="_blank" rel="noopener">Lab1&lt;/a>
assignment of module DSCI_554 of Dr. Gilberto Alexi Rodriguez Arelis&lt;/p>
&lt;p>Inspiration from (Grigg 2018)&lt;/p>
&lt;div id="refs" class="references hanging-indent">
&lt;div id="ref-blyth1972simpson">
&lt;p>Blyth, Colin R. 1972. “On Simpson’s Paradox and the Sure-Thing
Principle.” &lt;em>Journal of the American Statistical Association&lt;/em> 67 (338):
364–66.&lt;/p>
&lt;/div>
&lt;div id="ref-Britannica">
&lt;p>Carlson, B. W. 2019. “Simpson’s Paradox.” Edited by Encyclopedia
Britannica.
&lt;a href="https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765">https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765&lt;/a>.&lt;/p>
&lt;/div>
&lt;div id="ref-TDS">
&lt;p>Grigg, Tom. 2018. “Simpson’s Paradox and Interpreting Data.”
&lt;a href="https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765">https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765&lt;/a>.&lt;/p>
&lt;/div>
&lt;div id="ref-pearson1899vi">
&lt;p>Pearson, Karl, Alice Lee, and Leslie Bramley-Moore. 1899. “VI.
Mathematical Contributions to the Theory of Evolution.-VI. Genetic
(Reproductive) Selection: Inheritance of Fertility in Man, and of
Fecundity in Thoroughbred Racehorses.” &lt;em>Philosophical Transactions of
the Royal Society of London. Series A, Containing Papers of a
Mathematical or Physical Character&lt;/em>, no. 192: 257–330.&lt;/p>
&lt;/div>
&lt;div id="ref-R">
&lt;p>R Core Team. 2020. &lt;em>R: A Language and Environment for Statistical
Computing&lt;/em>. Vienna, Austria: R Foundation for Statistical Computing.
&lt;a href="https://www.R-project.org/">https://www.R-project.org/&lt;/a>.&lt;/p>
&lt;/div>
&lt;div id="ref-simpson1951interpretation">
&lt;p>Simpson, Edward H. 1951. “The Interpretation of Interaction in
Contingency Tables.” &lt;em>Journal of the Royal Statistical Society: Series B
(Methodological)&lt;/em> 13 (2): 238–41.&lt;/p>
&lt;/div>
&lt;div id="ref-tidyverse">
&lt;p>Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy
D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019.
“Welcome to the tidyverse.” &lt;em>Journal of Open Source Software&lt;/em> 4 (43):
1686. &lt;a href="https://doi.org/10.21105/joss.01686">https://doi.org/10.21105/joss.01686&lt;/a>.&lt;/p>
&lt;/div>
&lt;div id="ref-yule1900vii">
&lt;p>Yule, George Udny. 1900. “VII. On the Association of Attributes in
Statistics: With Illustrations from the Material of the Childhood
Society, &amp;amp;c.” &lt;em>Philosophical Transactions of the Royal Society of
London. Series A, Containing Papers of a Mathematical or Physical
Character&lt;/em> 194 (252-261): 257–319.&lt;/p>
&lt;/div>
&lt;/div></description></item></channel></rss>