ggplot(bechdel, aes(x=budget_2013,y=gross_2013,color=binary,size=roi)) +
geom_point(alpha = 0.5) + facet_wrap(~clean_test) Grammar of Data Transformation
Lecture 4
Announcements/Reminders
Lab is due tomorrow night!!
Come to office hours and/or post on Ed for help!
Lab: Code Style (Do these things!!!)
Which of these pieces of code is easier to read?
ggplot(bechdel, aes(x = budget_2013, y = gross_2013,
color = binary, size = roi)) +
geom_point(alpha = 0.5) +
facet_wrap(~clean_test) Lab: Code Style (Do these things!!!)
Code should follow the tidyverse style:
there should be spaces before and line breaks after each
+when building aggplotthere should also be spaces before and line breaks after each
|>in a data transformation pipelinecode should be properly indented
spaces around
=signs and spaces after commas
All code should be visible in the PDF output (should not run off the page)! Use line breaks to prevent this.
Survey Responses
-
Concerns: first time coding + the quick pace of the course
-
Best way to succeed? Come to class/labs and ask for help!!!
-
Grading:
- Is the midterm take-home individual or collaborative?
Individual!!
- Is the class curved or absolute?
We will only curve up – your absolute grade is the minimum you can get
- Is the midterm take-home individual or collaborative?
- What is statistics/this class???
Outline
Last Time: Learned about more types of plots
Today: Grammar of ‘data wrangling’
AE-02: Any questions?
Data Transformation
dplyr
Primary package in the tidyverse for data wrangling and transformation
What is data transformation?
Making new variables (example: garage in AE03)
Renaming/reordering your data frame
Summarizing information about your variables
And more!
The pipe
The pipe,
|>, is an operator (a tool) for passing information from one process to another.Use
|>to pass the output of the previous line of code as the first input of the next line of code.When reading code “in English”, say “and then” whenever you see a pipe.
Row Operations
Row Operations
-
slice(): chooses rows based on location
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Row Operations
-
filter():chooses rows based on column values
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Row Operations
-
arrange(): changes the order of the rows
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Row Operations
-
sample_n(): take a random subset of the rows
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
slice()
Display the first five rows of bechdel:
# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
slice()
Display the first five rows of bechdel:
# A tibble: 5 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a Sla… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
filter()
Keep only the rows ofbechdel that pass the test:
bechdel# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
filter()
Keep only the rows ofbechdel that pass the test:
# A tibble: 753 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
2 About Time 2013 102648667 12000000 8.55 PASS ok
3 Admission 2013 36014634 13000000 2.77 PASS ok
4 American Hust… 2013 397915817 40000000 9.95 PASS ok
5 August: Osage… 2013 87609748 25000000 3.50 PASS ok
6 Beautiful Cre… 2013 75392809 50000000 1.51 PASS ok
7 Blue Jasmine 2013 101793664 18000000 5.66 PASS ok
8 Carrie 2013 120268278 30000000 4.01 PASS ok
9 Despicable Me… 2013 1338831390 76000000 17.6 PASS ok
10 Elysium 2013 379242208 120000000 3.16 PASS ok
# ℹ 743 more rows
filter()
Keep only the movies from before 2000
# A tibble: 337 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 10 Things I H… 1999 137877156 18180006 7.58 PASS ok
2 8MM 1999 185774868 55938481 3.32 FAIL notalk
3 American Beau… 1999 680094591 20976930 32.4 PASS ok
4 American Pie 1999 470616170 16781544 28.0 FAIL men
5 Analyze This 1999 396843410 41953861 9.46 FAIL notalk
6 Anna and the … 1999 109782424 104884652 1.05 FAIL men
7 Anywhere But … 1999 52172744 32164627 1.62 FAIL dubious
8 Austin Powers… 1999 722127642 48946171 14.8 FAIL notalk
9 Being John Ma… 1999 77252870 18180006 4.25 PASS ok
10 Black and Whi… 1999 14659560 13984620 1.05 PASS ok
# ℹ 327 more rows
filter()
Often (but not always), looks like:
filter(variable [logical operator] value)or
Some logical operators
| operator | definition |
|---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
More logical operators
| operator | definition |
|---|---|
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? (only makes sense if x is TRUE or FALSE) |
filter()
Keep only the movies from before 2000 AND that pass the test
# A tibble: 147 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 10 Things I … 1999 137877156 18180006 7.58 PASS ok
2 American Bea… 1999 680094591 20976930 32.4 PASS ok
3 Being John M… 1999 77252870 18180006 4.25 PASS ok
4 Black and Wh… 1999 14659560 13984620 1.05 PASS ok
5 Boys Don't C… 1999 45144602 2796924 16.1 PASS ok
6 But I'm a Ch… 1999 6761310 1678154 4.03 PASS ok
7 Carrie 2: Th… 1999 49674054 29367703 1.69 PASS ok
8 Cruel Intent… 1999 159471926 15383082 10.4 PASS ok
9 Dick 1999 17555926 18180006 0.966 PASS ok
10 Drop Dead Go… 1999 29567426 13984620 2.11 PASS ok
# ℹ 137 more rows
sample_n()
Display five random rows ofbechdel:
bechdel # A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
sample_n()
Display five random rows ofbechdel:
# A tibble: 5 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Crash 2004 193003880 9006109 21.4 PASS ok
2 Wrath of the T… 2012 391285385 152195785 2.57 FAIL notalk
3 The Visitor 2007 32135415 4494174 7.15 FAIL dubious
4 Take the Lead 2006 115732846 34673890 3.34 PASS ok
5 21 grams 2003 96116670 25321739 3.80 FAIL dubious
Column Operations
Column operations
-
select(): changes whether or not a column is included.
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Column operations
-
rename(): changes the name of columns.
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Column operations
-
mutate(): changes the values of columns and creates new columns.
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
select()
Keep only the title and test status.
bechdel# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
select()
Keep only the title and test status.
mutate()
Create a new variable for the budget in millions
bechdel# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
mutate()
Keep only the title and test status.
# A tibble: 1,615 × 8
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
# ℹ 1 more variable: budget_million <dbl>
mutate()
Generally, looks like:
mutate(new_variable_name = function(existing_variable))rename()
Rename clean_test to test_result
bechdel# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
rename()
Rename clean_test to test_result
# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary test_result
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a … 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day … 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
rename()
Generally, looks like:
select(new_variable_name = old_variable_name)Groups of rows
Groups of rows
-
count(): count unique values of one or more variables.
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Groups of rows
-
group_by(): group separately for each value of a variable
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
Groups of rows
-
summarize(): collapses a group into a single row.
| Variable1 | Variable2 | Variable3 |
|---|---|---|
| 1 | A | Yes |
| 2 | B | Yes |
| 5 | A | No |
| 2 | B | No |
| 1 | A | No |
| 4 | B | No |
count()
Count how many movies pass or fail the Bechdel test.
bechdel# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
count()
Count how many movies pass or fail the Bechdel test.
summarize()
Compute average budget
bechdel # A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
summarize()
Compute average budget
summarize()
Generally, looks like:
summarize(resut_variable_name = function(existing_variable))group_by()
Group by movies passing or failing the test
bechdel# A tibble: 1,615 × 7
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
group_by()
Group by movies passing or failing the test
bechdel |>
group_by(binary)# A tibble: 1,615 × 7
# Groups: binary [2]
title year gross_2013 budget_2013 roi binary clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 21 & Over 2013 67878146 13000000 5.22 FAIL notalk
2 Dredd 3D 2012 55078343 45658735 1.21 PASS ok
3 12 Years a S… 2013 211714070 20000000 10.6 FAIL notalk
4 2 Guns 2013 208105475 61000000 3.41 FAIL notalk
5 42 2013 190040426 40000000 4.75 FAIL men
6 47 Ronin 2013 184166317 225000000 0.819 FAIL men
7 A Good Day t… 2013 371598396 92000000 4.04 FAIL notalk
8 About Time 2013 102648667 12000000 8.55 PASS ok
9 Admission 2013 36014634 13000000 2.77 PASS ok
10 After Earth 2013 304895295 130000000 2.35 FAIL notalk
# ℹ 1,605 more rows
group_by() + summarize()
Group by movies passing/failing and compute within-group average budget
bechdel |>
group_by(binary) |>
summarize(mean_budget = mean(budget_2013))# A tibble: 2 × 2
binary mean_budget
<chr> <dbl>
1 FAIL 65877024.
2 PASS 46913086.
