Data Ethics
Lecture 12
Announcements
- Office hours today from 3:30 - 7:30
- Me: 3:30 - 5:30
- Katie: 5:30 - 7:30
- If you were planning on coming to Wednesday’s office hours, let me know!
Conduct Reminder
Cite your sources! Seriously!
Policy:
You must explicitly cite the usage of any external online resources. Failure to do so is academic dishonesty.
Tomorrow!!!
Today’s advice
Now that you only have 24 hours…
- Study the Kahoot and the practice problems;
- Look over the lab problems/solutions
- Spend some serious time with your cheat sheet;
- Review notes/readings for topics you feel unsure about
Project
Project goal:
- Work with a team on a full data analysis project
- Today: how to use git collaboratively!
Misrepresentation
Misrepresenting data science results
Some common ways people do this, either intentionally or unintentionally, include:
Claiming causality where it’s not in the scope of inference of the underlying study
Distorting axes and scales to make the data tell a different story
Visualizing spatial areas instead of human density for issues that depend on and affect humans
Omitting uncertainty in reporting
Causality
My favorite statistics lesson:
Correlation does not imply causation.
Causality: Ice Cream and Drowning
Causality: Ice Cream and Drowning
Causality: Ice Cream and Drowning
Causality: Ice Cream and Drowning
Causality - TIME coverage
How plausible is the statement in the title of this article?
Alice Park. Exercise Can Lower Risk of Some Cancers By 20%. Time Magazine. 16 May 2016.
Causality - LA Times coverage
What does “research shows” mean?
Melissa Healy. Exercising drives down risk for 13 cancers, research shows.
Los Angeles Times. 16 May 2016.
Causality - Original study
Moore, Steven C., et al. “Association of leisure-time physical activity with risk of 26 types of cancer in 1.44 million adults.” JAMA internal medicine 176.6 (2016): 816-825.
- Volunteers were asked about their physical activity level over the preceding year.
- Half exercised less than about 150 minutes per week, half exercised more.
- Compared to the bottom 10% of exercisers, the top 10% had lower rates of esophageal, liver, lung, endometrial, colon, and breast cancer.
- Researchers found no association between exercising and 13 other cancers (e.g. pancreatic, ovarian, and brain).
Carl Bergstrom and Jevin West. Calling Bullshit: The art of skepticism in a data-driven world.
Random House, 2020.
Sharon Begley. “Does exercise prevent cancer?”. StatNews. 16 May 2016.
#Axes and scales
Remember this?
Axes and scales - Tax cuts
What is the difference between these two pictures? Which presents a better way to represent these data?
Christopher Ingraham. “You’ve been reading charts wrong. Here’s how a pro does it.”. The Washington Post. 14 October 2019.
Axes and scales - Cost of gas
What is wrong with this image?
Axes and scales - Cost of gas
df <- tibble(
date = ymd(c("2019-11-01", "2020-10-25", "2020-11-01")),
cost = c(3.17, 3.51, 3.57)
)
ggplot(df, aes(x = date, y = cost, group = 1)) +
geom_point() +
geom_line() +
geom_label(aes(label = cost), hjust = -0.25) +
labs(
title = "Cost of gas",
subtitle = "National average",
x = NULL, y = NULL,
caption = "Source: AAA Fuel Gauge Report"
) +
scale_x_continuous(
breaks = ymd(c("2019-11-01", "2020-10-25", "2020-11-01")),
labels = c("Last year", "Last week", "Current"),
guide = guide_axis(angle = 90),
limits = ymd(c("2019-11-01", "2020-11-29")),
minor_breaks = ymd(c("2019-11-01", "2020-10-25", "2020-11-01"))
) +
scale_y_continuous(labels = label_dollar())
Axes and scales - COVID in GA
What is wrong with this image?
Georgia Department of Public Health. 11 May 2020.
Axes and scales - COVID in GA
Lucy D’Agostino McGowan. Graph detective. Live Free or Dichotomize. 17 May 2020.
Axes and scales - PP services
What is wrong with this picture? How would you correct it?
Timothy B. Lee. Whatever you think of Planned Parenthood, this is a terrible and dishonest chart. Vox. 29 September 201 9.
Axes and scales - PP services
pp <- tibble(
year = c(2006, 2006, 2013, 2013),
service = c("Abortion", "Cancer", "Abortion", "Cancer"),
n = c(289750, 2007371, 327000, 935573)
)
ggplot(pp, aes(x = year, y = n, color = service)) +
geom_point(size = 2) +
geom_line(linewidth = 1) +
geom_text(aes(label = n), nudge_y = 100000) +
geom_text(
aes(label = year),
nudge_y = 200000,
color = "darkgray"
) +
labs(
title = "Services provided by Planned Parenthood",
caption = "Source: Planned Parenthood",
x = NULL,
y = NULL
) +
scale_x_continuous(breaks = c(2006, 2013)) +
scale_y_continuous(labels = label_number(big.mark = ",")) +
scale_color_manual(values = c("darkred", "hotpink")) +
annotate(
geom = "text",
label = "Abortions",
x = 2009.5,
y = 400000,
color = "darkred"
) +
annotate(
geom = "text",
label = "Cancer screening\nand prevention services",
x = 2010.5,
y = 1600000,
color = "hotpink"
) +
theme_minimal() +
theme(legend.position = "none")
Maps and areas
Maps and areas - Voting map
Do you recognize this map? What does it show?
Lazaro Gamio. “Election maps are telling you big lies about small things”. The Washington Post. 1 Nov 2016.
Maps and areas - Voting percentages
Alberto Cairo. Visual Trumpery talk.
Maps and areas - Voting percentages
Alberto Cairo. Visual Trumpery talk.
Uncertainty
Uncertainty - Catalan independence
On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.
Alberto Cairo. The truthful art: Data, charts, and maps for communication. New Riders, 2016.
Uncertainty - Catalan independence
Alberto Cairo. “Uncertainty and Graphicacy: How Should Statisticians Journalists and Designers Reveal Uncertainty in Graphics for Public Consumption?”, Power from Statistics: Data Information and Knowledge, 2017.
Algorithmic bias
California Proposition 25 (2020)
Popular referendum on 2018’s Senate Bill 10:
-
YES: replace cash bail with “risk assessment”.
- Democratic Party, Governor Gavin Newson, League of Women Voters of California, California Medical Association, Democracy for America (progressive PAC), etc.
-
NO: keep the cash bail system.
- Republican Party, American Bail Coalition, ACLU of Southern California, NAACP, California Asian Pacific Chamber of Commerce, etc.
If passed, each county would be empowered to develop a tool that predicts the risk of a suspect reoffending before trial.
Judges would consult this prediction to make bail decisions.
What might “risk assessment” look like?
Something we will study after the midterm:
Above the line means high risk means no bail. Is this progress?
What happens when we try “predictive policing”?
2016 ProPublica article on algorithm used for rating a defendant’s risk of future crime:
In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.
The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
White defendants were mislabeled as low risk more often than black defendants.
Source: ProPublica
Notice anything?
What is common among the defendants who were assigned a high/low risk score for reoffending?
“But race wasn’t in my model”
How can an algorithm that doesn’t use race as input data be racist?
Predicting ethnicity
Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record (Imran and Khan, 2016)
Open source package makes probabilistic estimates of individual-level race/ethnicity given voter file information.
wru package
Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?
library(wru)
predict_race(voter.file = voters, surname.only = TRUE) |>
select(surname, pred.whi, pred.bla, pred.his, pred.asi, pred.oth) |>
slice(1:5)
surname pred.whi pred.bla pred.his pred.asi pred.oth
1 Khanna 0.045110474 0.003067623 0.0068522723 0.86041191 0.084557725
2 Imai 0.052645440 0.001334812 0.0558160072 0.71937658 0.170827160
3 Rivera 0.043285692 0.008204605 0.9136195794 0.02431688 0.010573240
4 Fifield 0.895405704 0.001911388 0.0337464844 0.01107932 0.057857101
5 Zhou 0.006572555 0.001298962 0.0005388581 0.98236559 0.009224032
California Prop 25 did not pass
The cash bail system was retained:
Choice | Votes | Percent |
---|---|---|
Yes | 7,232,380 | 43.59% |
No | 9,358,226 | 56.41% |
- reasonable people can debate if this outcome is good or bad;
- every Californian was invited to decide whether statistics and data science should be deployed to make decisions with major social consequences. They opted out;
- This vote was held in the pre-ChatGPT era. What would the outcome be today? Is the case for YES stronger or weaker?
Another algorithmic decision…
Armies of stats PhDs go to work on these models. They (generally) have no training in the ethics of what they’re doing.
A success story?
. . .
Data
+ Model
to predict timing of menstrual cycle:
. . .
A perfect microcosm of the themes of our course…
. . .
…but what if you learned they were selling your data?
Data privacy
Data privacy
“Your” data
Every time we use apps, websites, and devices, our data is being collected and used or sold to others.
More importantly, decisions are made by law enforcement, financial institutions, and governments based on data that directly affect the lives of people.
Privacy of your data
What pieces of data have you left on the internet today?
Think through everything you’ve logged into, clicked on, checked in, either actively or automatically, that might be tracking you.
Do you know where that data is stored? Who it can be accessed by? Whether it’s shared with others?
What does Google think/know about you?
Have you ever thought about why you’re seeing an ad on Google? Try to figure out if you have ad personalization on and how your ads are personalized.
Your browsing history
Which of the following are you OK with your browsing history to be used towards?
- For serving you targeted ads
- To score you as a candidate for a job
- To predict your race/ethnicity for voting purposes
Who else gets to use your data?
Suppose you create a profile on a social media site and share your personal information on your profile. Who else gets to use that data?
- Companies the social media company has a connection to?
- Companies the social media company sells your data to?
- Researchers?
AOL search data leak
2006: AOL released a file with millions of “anonymous” search queries from users over 3 months; data was intended for research
New York Times used search queries to identify users
-
User #4417749:
“numb fingers”
“60 single men”
“dog that urinates on everything.”
“landscapers in Lilburn, Ga”
AOL search data leak
Michael Barbaro and Tom Zeller Jr. A Face Is Exposed for AOL Searcher No. 4417749.
New York Times. 9 August 2006.
OK Cupid data breach
In 2016, researchers published data of 70,000 OkCupid users—including usernames, political leanings, drug usage, and intimate sexual details
Researchers didn’t release the real names and pictures of OKCupid users, but their identities could easily be uncovered from the details provided, e.g. usernames
OK Cupid data breach
OK Cupid data breach
Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.
Researchers Emil Kirkegaard and Julius Daugbjerg Bjerrekær
Data privacy
In analysis of data that individuals willingly shared publicly on a given platform (e.g. social media), how do you make sure you don’t violate reasonable expectations of privacy?
Some good news?
Faster, more accurate cancer screening?
Augmenting doctors’ diagnostic capacity so that they make fewer mistakes, treat more people, and focus on other aspects of care:
The Nobel Prize last year
AlphaFold2: “predicting 3D structures [of proteins] (\(y\)) directly from the primary amino acid sequence (\(x\)).”
“researchers can now better understand antibiotic resistance and create images of enzymes that can decompose plastic.”
Parting thoughts
At some point during your data science learning journey you will learn tools that can be used unethically
You might also be tempted to use your knowledge in a way that is ethically questionable either because of business goals or for the pursuit of further knowledge (or because your boss told you to do so)
How do you train yourself to make the right decisions (or reduce the likelihood of accidentally making the wrong decisions) at those points?
Further reading
How Charts Lie
Calling Bullshit
Calling Bullshit
The Art of Skepticism in a
Data-Driven World
by Carl Bergstrom and Jevin West
Invisible Women
Invisible Women: Data Bias in a World Designed for Men
by Caroline Criado Perez
Machine Bias
by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner
Ethics and Data Science
by Mike Loukides, Hilary Mason, DJ Patil
(Free Kindle download)
Weapons of Math Destruction
Weapons of Math Destruction
How Big Data Increases Inequality and Threatens Democracy
by Cathy O’Neil
Algorithms of Oppression
Algorithms of Oppression
How Search Engines Reinforce Racism
by Safiya Umoja Noble
And more recently…
How AI discriminates and what that means for your Google habit
A conversation with UCLA internet studies scholar Safiya Noble
by Julia Busiek