Lecture 12
June 1, 2025
Cite your sources! Seriously!
You must explicitly cite the usage of any external online resources. Failure to do so is academic dishonesty.
Now that you only have 24 hours…
Some common ways people do this, either intentionally or unintentionally, include:
Claiming causality where it’s not in the scope of inference of the underlying study
Distorting axes and scales to make the data tell a different story
Visualizing spatial areas instead of human density for issues that depend on and affect humans
Omitting uncertainty in reporting
Correlation does not imply causation.
How plausible is the statement in the title of this article?
What does “research shows” mean?
Moore, Steven C., et al. “Association of leisure-time physical activity with risk of 26 types of cancer in 1.44 million adults.” JAMA internal medicine 176.6 (2016): 816-825.
#Axes and scales
What is the difference between these two pictures? Which presents a better way to represent these data?
What is wrong with this image?
df <- tibble(
date = ymd(c("2019-11-01", "2020-10-25", "2020-11-01")),
cost = c(3.17, 3.51, 3.57)
)
ggplot(df, aes(x = date, y = cost, group = 1)) +
geom_point() +
geom_line() +
geom_label(aes(label = cost), hjust = -0.25) +
labs(
title = "Cost of gas",
subtitle = "National average",
x = NULL, y = NULL,
caption = "Source: AAA Fuel Gauge Report"
) +
scale_x_continuous(
breaks = ymd(c("2019-11-01", "2020-10-25", "2020-11-01")),
labels = c("Last year", "Last week", "Current"),
guide = guide_axis(angle = 90),
limits = ymd(c("2019-11-01", "2020-11-29")),
minor_breaks = ymd(c("2019-11-01", "2020-10-25", "2020-11-01"))
) +
scale_y_continuous(labels = label_dollar())
What is wrong with this image?
What is wrong with this picture? How would you correct it?
pp <- tibble(
year = c(2006, 2006, 2013, 2013),
service = c("Abortion", "Cancer", "Abortion", "Cancer"),
n = c(289750, 2007371, 327000, 935573)
)
ggplot(pp, aes(x = year, y = n, color = service)) +
geom_point(size = 2) +
geom_line(linewidth = 1) +
geom_text(aes(label = n), nudge_y = 100000) +
geom_text(
aes(label = year),
nudge_y = 200000,
color = "darkgray"
) +
labs(
title = "Services provided by Planned Parenthood",
caption = "Source: Planned Parenthood",
x = NULL,
y = NULL
) +
scale_x_continuous(breaks = c(2006, 2013)) +
scale_y_continuous(labels = label_number(big.mark = ",")) +
scale_color_manual(values = c("darkred", "hotpink")) +
annotate(
geom = "text",
label = "Abortions",
x = 2009.5,
y = 400000,
color = "darkred"
) +
annotate(
geom = "text",
label = "Cancer screening\nand prevention services",
x = 2010.5,
y = 1600000,
color = "hotpink"
) +
theme_minimal() +
theme(legend.position = "none")
Do you recognize this map? What does it show?
On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.
Popular referendum on 2018’s Senate Bill 10:
YES: replace cash bail with “risk assessment”.
NO: keep the cash bail system.
If passed, each county would be empowered to develop a tool that predicts the risk of a suspect reoffending before trial.
Judges would consult this prediction to make bail decisions.
Something we will study after the midterm:
Above the line means high risk means no bail. Is this progress?
2016 ProPublica article on algorithm used for rating a defendant’s risk of future crime:
In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.
The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
White defendants were mislabeled as low risk more often than black defendants.
What is common among the defendants who were assigned a high/low risk score for reoffending?
How can an algorithm that doesn’t use race as input data be racist?
Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record (Imran and Khan, 2016)
Open source package makes probabilistic estimates of individual-level race/ethnicity given voter file information.
Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?
library(wru)
predict_race(voter.file = voters, surname.only = TRUE) |>
select(surname, pred.whi, pred.bla, pred.his, pred.asi, pred.oth) |>
slice(1:5)
surname pred.whi pred.bla pred.his pred.asi pred.oth
1 Khanna 0.045110474 0.003067623 0.0068522723 0.86041191 0.084557725
2 Imai 0.052645440 0.001334812 0.0558160072 0.71937658 0.170827160
3 Rivera 0.043285692 0.008204605 0.9136195794 0.02431688 0.010573240
4 Fifield 0.895405704 0.001911388 0.0337464844 0.01107932 0.057857101
5 Zhou 0.006572555 0.001298962 0.0005388581 0.98236559 0.009224032
The cash bail system was retained:
Choice | Votes | Percent |
---|---|---|
Yes | 7,232,380 | 43.59% |
No | 9,358,226 | 56.41% |
Armies of stats PhDs go to work on these models. They (generally) have no training in the ethics of what they’re doing.
Data
+ Model
to predict timing of menstrual cycle:
A perfect microcosm of the themes of our course…
…but what if you learned they were selling your data?
Every time we use apps, websites, and devices, our data is being collected and used or sold to others.
More importantly, decisions are made by law enforcement, financial institutions, and governments based on data that directly affect the lives of people.
What pieces of data have you left on the internet today?
Think through everything you’ve logged into, clicked on, checked in, either actively or automatically, that might be tracking you.
Do you know where that data is stored? Who it can be accessed by? Whether it’s shared with others?
What are you OK with sharing?
Have you ever thought about why you’re seeing an ad on Google? Try to figure out if you have ad personalization on and how your ads are personalized.
Which of the following are you OK with your browsing history to be used towards?
Suppose you create a profile on a social media site and share your personal information on your profile. Who else gets to use that data?
2006: AOL released a file with millions of “anonymous” search queries from users over 3 months; data was intended for research
New York Times used search queries to identify users
User #4417749:
“numb fingers”
“60 single men”
“dog that urinates on everything.”
“landscapers in Lilburn, Ga”
In 2016, researchers published data of 70,000 OkCupid users—including usernames, political leanings, drug usage, and intimate sexual details
Researchers didn’t release the real names and pictures of OKCupid users, but their identities could easily be uncovered from the details provided, e.g. usernames
Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.
Researchers Emil Kirkegaard and Julius Daugbjerg Bjerrekær
In analysis of data that individuals willingly shared publicly on a given platform (e.g. social media), how do you make sure you don’t violate reasonable expectations of privacy?
Augmenting doctors’ diagnostic capacity so that they make fewer mistakes, treat more people, and focus on other aspects of care:
AlphaFold2: “predicting 3D structures [of proteins] (\(y\)) directly from the primary amino acid sequence (\(x\)).”
“researchers can now better understand antibiotic resistance and create images of enzymes that can decompose plastic.”
At some point during your data science learning journey you will learn tools that can be used unethically
You might also be tempted to use your knowledge in a way that is ethically questionable either because of business goals or for the pursuit of further knowledge (or because your boss told you to do so)
How do you train yourself to make the right decisions (or reduce the likelihood of accidentally making the wrong decisions) at those points?
Calling Bullshit
The Art of Skepticism in a
Data-Driven World
by Carl Bergstrom and Jevin West
Invisible Women: Data Bias in a World Designed for Men
by Caroline Criado Perez
by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner
by Mike Loukides, Hilary Mason, DJ Patil
(Free Kindle download)
Weapons of Math Destruction
How Big Data Increases Inequality and Threatens Democracy
by Cathy O’Neil
Algorithms of Oppression
How Search Engines Reinforce Racism
by Safiya Umoja Noble
How AI discriminates and what that means for your Google habit
A conversation with UCLA internet studies scholar Safiya Noble
by Julia Busiek