Grammar of Data Transformation

Lecture 4

Published

May 19, 2025

Announcements/Reminders

  • Lab is due tomorrow night!!

  • Come to office hours and/or post on Ed for help!

Lab: Code Style (Do these things!!!)

Which of these pieces of code is easier to read?

ggplot(bechdel, aes(x=budget_2013,y=gross_2013,color=binary,size=roi)) +
geom_point(alpha = 0.5) + facet_wrap(~clean_test) 



ggplot(bechdel, aes(x = budget_2013, y = gross_2013,
                    color = binary, size = roi)) +
  geom_point(alpha = 0.5) + 
  facet_wrap(~clean_test) 

Lab: Code Style (Do these things!!!)

Code should follow the tidyverse style:

  • there should be spaces before and line breaks after each + when building a ggplot

  • there should also be spaces before and line breaks after each |> in a data transformation pipeline

  • code should be properly indented

  • spaces around = signs and spaces after commas

All code should be visible in the PDF output (should not run off the page)! Use line breaks to prevent this.

Survey Responses

  • Concerns: first time coding + the quick pace of the course
  • Best way to succeed? Come to class/labs and ask for help!!!
  • Grading:
    • Is the midterm take-home individual or collaborative?
      Individual!!
    • Is the class curved or absolute?
      We will only curve up – your absolute grade is the minimum you can get
  • What is statistics/this class???

Outline

  • Last Time: Learned about more types of plots

  • Today: Grammar of ‘data wrangling’

AE-02: Any questions?

Data Transformation

dplyr

Primary package in the tidyverse for data wrangling and transformation

What is data transformation?

  • Making new variables (example: garage in AE03)

  • Renaming/reordering your data frame

  • Summarizing information about your variables

  • And more!

The pipe

  • The pipe, |>, is an operator (a tool) for passing information from one process to another.

  • Use |> to pass the output of the previous line of code as the first input of the next line of code.

  • When reading code “in English”, say “and then” whenever you see a pipe.

Row Operations

Row Operations

  • slice(): chooses rows based on location


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Row Operations

  • filter():chooses rows based on column values


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Row Operations

  • arrange(): changes the order of the rows


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Row Operations

  • sample_n(): take a random subset of the rows


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

slice()

Display the first five rows of bechdel:

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

slice()

Display the first five rows of bechdel:

bechdel |>
  slice(1:5)
# A tibble: 5 × 7
  title            year gross_2013 budget_2013   roi binary clean_test
  <chr>           <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
1 21 & Over        2013   67878146    13000000  5.22 FAIL   notalk    
2 Dredd 3D         2012   55078343    45658735  1.21 PASS   ok        
3 12 Years a Sla…  2013  211714070    20000000 10.6  FAIL   notalk    
4 2 Guns           2013  208105475    61000000  3.41 FAIL   notalk    
5 42               2013  190040426    40000000  4.75 FAIL   men       

filter()

Keep only the rows ofbechdel that pass the test:

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

filter()

Keep only the rows ofbechdel that pass the test:

bechdel |>
  filter(binary == "PASS")
# A tibble: 753 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 Dredd 3D        2012   55078343    45658735  1.21 PASS   ok        
 2 About Time      2013  102648667    12000000  8.55 PASS   ok        
 3 Admission       2013   36014634    13000000  2.77 PASS   ok        
 4 American Hust…  2013  397915817    40000000  9.95 PASS   ok        
 5 August: Osage…  2013   87609748    25000000  3.50 PASS   ok        
 6 Beautiful Cre…  2013   75392809    50000000  1.51 PASS   ok        
 7 Blue Jasmine    2013  101793664    18000000  5.66 PASS   ok        
 8 Carrie          2013  120268278    30000000  4.01 PASS   ok        
 9 Despicable Me…  2013 1338831390    76000000 17.6  PASS   ok        
10 Elysium         2013  379242208   120000000  3.16 PASS   ok        
# ℹ 743 more rows

filter()

Keep only the movies from before 2000

bechdel |>
  filter(year < 2000)
# A tibble: 337 × 7
   title           year gross_2013 budget_2013   roi binary clean_test
   <chr>          <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 10 Things I H…  1999  137877156    18180006  7.58 PASS   ok        
 2 8MM             1999  185774868    55938481  3.32 FAIL   notalk    
 3 American Beau…  1999  680094591    20976930 32.4  PASS   ok        
 4 American Pie    1999  470616170    16781544 28.0  FAIL   men       
 5 Analyze This    1999  396843410    41953861  9.46 FAIL   notalk    
 6 Anna and the …  1999  109782424   104884652  1.05 FAIL   men       
 7 Anywhere But …  1999   52172744    32164627  1.62 FAIL   dubious   
 8 Austin Powers…  1999  722127642    48946171 14.8  FAIL   notalk    
 9 Being John Ma…  1999   77252870    18180006  4.25 PASS   ok        
10 Black and Whi…  1999   14659560    13984620  1.05 PASS   ok        
# ℹ 327 more rows

filter()

Often (but not always), looks like:

filter(variable [logical operator] value)

or

filter(!is.na(variable))

Some logical operators

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?

More logical operators

operator definition
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x? (only makes sense if x is TRUE or FALSE)

filter()

Keep only the movies from before 2000 AND that pass the test

bechdel |>
  filter(year < 2000 & binary == "PASS")
# A tibble: 147 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 10 Things I …  1999  137877156    18180006  7.58  PASS   ok        
 2 American Bea…  1999  680094591    20976930 32.4   PASS   ok        
 3 Being John M…  1999   77252870    18180006  4.25  PASS   ok        
 4 Black and Wh…  1999   14659560    13984620  1.05  PASS   ok        
 5 Boys Don't C…  1999   45144602     2796924 16.1   PASS   ok        
 6 But I'm a Ch…  1999    6761310     1678154  4.03  PASS   ok        
 7 Carrie 2: Th…  1999   49674054    29367703  1.69  PASS   ok        
 8 Cruel Intent…  1999  159471926    15383082 10.4   PASS   ok        
 9 Dick           1999   17555926    18180006  0.966 PASS   ok        
10 Drop Dead Go…  1999   29567426    13984620  2.11  PASS   ok        
# ℹ 137 more rows

sample_n()

Display five random rows ofbechdel:

bechdel 
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

sample_n()

Display five random rows ofbechdel:

bechdel |>
  sample_n(5)
# A tibble: 5 × 7
  title            year gross_2013 budget_2013   roi binary clean_test
  <chr>           <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
1 Crash            2004  193003880     9006109 21.4  PASS   ok        
2 Wrath of the T…  2012  391285385   152195785  2.57 FAIL   notalk    
3 The Visitor      2007   32135415     4494174  7.15 FAIL   dubious   
4 Take the Lead    2006  115732846    34673890  3.34 PASS   ok        
5 21 grams         2003   96116670    25321739  3.80 FAIL   dubious   

Column Operations

Column operations

  • select(): changes whether or not a column is included.


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Column operations

  • rename(): changes the name of columns.


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Column operations

  • mutate(): changes the values of columns and creates new columns.


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

select()

Keep only the title and test status.

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

select()

Keep only the title and test status.

bechdel |>
  select(title, clean_test)
# A tibble: 1,615 × 2
   title                  clean_test
   <chr>                  <chr>     
 1 21 & Over              notalk    
 2 Dredd 3D               ok        
 3 12 Years a Slave       notalk    
 4 2 Guns                 notalk    
 5 42                     men       
 6 47 Ronin               men       
 7 A Good Day to Die Hard notalk    
 8 About Time             ok        
 9 Admission              ok        
10 After Earth            notalk    
# ℹ 1,605 more rows

mutate()

Create a new variable for the budget in millions

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

mutate()

Keep only the title and test status.

bechdel |>
  mutate(budget_million = budget_2013/1000000)
# A tibble: 1,615 × 8
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows
# ℹ 1 more variable: budget_million <dbl>

mutate()

Generally, looks like:

mutate(new_variable_name = function(existing_variable))

rename()

Rename clean_test to test_result

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

rename()

Rename clean_test to test_result

bechdel |>
  rename(test_result = clean_test)
# A tibble: 1,615 × 7
   title         year gross_2013 budget_2013    roi binary test_result
   <chr>        <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>      
 1 21 & Over     2013   67878146    13000000  5.22  FAIL   notalk     
 2 Dredd 3D      2012   55078343    45658735  1.21  PASS   ok         
 3 12 Years a …  2013  211714070    20000000 10.6   FAIL   notalk     
 4 2 Guns        2013  208105475    61000000  3.41  FAIL   notalk     
 5 42            2013  190040426    40000000  4.75  FAIL   men        
 6 47 Ronin      2013  184166317   225000000  0.819 FAIL   men        
 7 A Good Day …  2013  371598396    92000000  4.04  FAIL   notalk     
 8 About Time    2013  102648667    12000000  8.55  PASS   ok         
 9 Admission     2013   36014634    13000000  2.77  PASS   ok         
10 After Earth   2013  304895295   130000000  2.35  FAIL   notalk     
# ℹ 1,605 more rows

rename()

Generally, looks like:

select(new_variable_name = old_variable_name)

Groups of rows

Groups of rows

  • count(): count unique values of one or more variables.


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Groups of rows

  • group_by(): group separately for each value of a variable


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

Groups of rows

  • summarize(): collapses a group into a single row.


Variable1 Variable2 Variable3
1 A Yes
2 B Yes
5 A No
2 B No
1 A No
4 B No

count()

Count how many movies pass or fail the Bechdel test.

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

count()

Count how many movies pass or fail the Bechdel test.

bechdel |>
  count(binary)
# A tibble: 2 × 2
  binary     n
  <chr>  <int>
1 FAIL     862
2 PASS     753

summarize()

Compute average budget

bechdel 
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

summarize()

Compute average budget

bechdel |>
  summarize(mean_budget = mean(budget_2013))
# A tibble: 1 × 1
  mean_budget
        <dbl>
1   57035015.

summarize()

Generally, looks like:

summarize(resut_variable_name = function(existing_variable))

group_by()

Group by movies passing or failing the test

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

group_by()

Group by movies passing or failing the test

bechdel |>
  group_by(binary)
# A tibble: 1,615 × 7
# Groups:   binary [2]
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

group_by() + summarize()

Group by movies passing/failing and compute within-group average budget

bechdel |>
  group_by(binary) |>
  summarize(mean_budget = mean(budget_2013))
# A tibble: 2 × 2
  binary mean_budget
  <chr>        <dbl>
1 FAIL     65877024.
2 PASS     46913086.

Where can I find all of this??!

AE 04