More Data Transformation

Lecture 5

Published

May 19, 2025

Announcements/Reminders

  • Lab is due TONIGHT at 11:59PM

  • Come to office hours and/or post on Ed for help!

  • Office hours are tonight from 5:30-7:30PM in Old Chem 203.

  • AEs need to be pushed by the END OF CLASS (10:45AM)!!

  • Make sure you are in the right repository.

Lab: Narrative

  • Boxplots, histograms, density plots:

    • Center: Give an idea of a ‘typical value’ (median, most common range, etc.);

    • Spread: Does the data have a lot of variability? Or just a little? (IQR, outliers)

    • Shape: Is there skew? Which direction? Is it unimodal or multimodal?

Lab: Narrative

  • Scatter Plots

    • Is there a relationship between the two variables? Is it positive or negative?

    • If there is a relationship, does it seem to be strong or weak?

    • Is the relationship linear or nonlinear?

Lab: Advice

  • Be Specific in your narrative. Don’t just say “the spread is small and the center is low”. What does that mean??? Give numbers, units (if available), etc.

  • Be Specific in your plot labels.

  • Answer all parts of the question!! Statements like “Compare…” mean to do so in the written answer!

  • Written answers should be text outside of code chunks, not comments

  • Make sure you can render to PDF early!!!

Outline

  • Last Time: Started learning about data transformation!

  • Today:

    • Review from last time + finish AE04

    • More about the pipe

    • Transformation + plotting

Quick Review: Row operations

  • slice(): chooses rows based on location
  • filter(): chooses rows based on column values
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows
X1 X2 X3
1 a yes
3 b no
5 a yes
7 b yes
9 a yes

Quick Review: Column operations

  • select(): changes whether or not a column is included
  • rename(): changes the name of columns
  • mutate(): changes the values of columns and creates new columns
X1 X2 X3
1 a yes
3 b no
5 a yes
7 b yes
9 a yes

Quick Review: Groups of rows

  • summarize(): collapses a group into a single row
  • count(): count unique values of one or more variables
  • group_by(): perform calculations separately for each value of a variable
X1 X2 X3
1 a yes
3 b no
5 a yes
7 b yes
9 a yes

AE-04

Recap: Group by, summarize, mutate

What does group by do here?

bechdel |>
  group_by(binary) |>
  summarise(mean_roi = mean(roi, na.rm = TRUE), 
            mean_budget = mean(budget_2013)) 
# A tibble: 2 × 3
  binary mean_roi mean_budget
  <chr>     <dbl>       <dbl>
1 FAIL       8.36   65877024.
2 PASS       7.99   46913086.

Recap: Group by, summarize, mutate

What does group by do here?

bechdel |>
  #group_by(binary) |>
  summarise(mean_roi = mean(roi, na.rm = TRUE), 
            mean_budget = mean(budget_2013)) 
# A tibble: 1 × 2
  mean_roi mean_budget
     <dbl>       <dbl>
1     8.19   57035015.

Recap: Group by, summarize, mutate

What if I change summarize to mutate?

bechdel |>
  group_by(binary) |>
  mutate(mean_roi = mean(roi, na.rm = TRUE), 
            mean_budget = mean(budget_2013)) 
# A tibble: 1,615 × 9
# Groups:   binary [2]
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows
# ℹ 2 more variables: mean_roi <dbl>, mean_budget <dbl>

Recap: Group by, summarize, mutate

What if I change summarize to mutate?

bechdel |>
  group_by(binary) |>
  mutate(mean_roi = mean(roi, na.rm = TRUE), 
            mean_budget = mean(budget_2013)) |>
  select(title, binary, mean_roi, mean_budget)
# A tibble: 1,615 × 4
# Groups:   binary [2]
   title                  binary mean_roi mean_budget
   <chr>                  <chr>     <dbl>       <dbl>
 1 21 & Over              FAIL       8.36   65877024.
 2 Dredd 3D               PASS       7.99   46913086.
 3 12 Years a Slave       FAIL       8.36   65877024.
 4 2 Guns                 FAIL       8.36   65877024.
 5 42                     FAIL       8.36   65877024.
 6 47 Ronin               FAIL       8.36   65877024.
 7 A Good Day to Die Hard FAIL       8.36   65877024.
 8 About Time             PASS       7.99   46913086.
 9 Admission              PASS       7.99   46913086.
10 After Earth            FAIL       8.36   65877024.
# ℹ 1,605 more rows

Recap: Group by, summarize, mutate

You can group by more than one variable!

bechdel |>
  group_by(binary, year) |>
  summarise(mean_roi = mean(roi, na.rm = TRUE), 
            mean_budget = mean(budget_2013)) 
`summarise()` has grouped output by 'binary'. You can override using
the `.groups` argument.
# A tibble: 48 × 4
# Groups:   binary [2]
   binary  year mean_roi mean_budget
   <chr>  <dbl>    <dbl>       <dbl>
 1 FAIL    1990    13.1    73101991.
 2 FAIL    1991     6.71   74299353 
 3 FAIL    1992    46.3    38883983.
 4 FAIL    1993     6.36   49099374.
 5 FAIL    1994    23.7    57812022.
 6 FAIL    1995     5.69   68251510.
 7 FAIL    1996     3.82   68600475.
 8 FAIL    1997    16.5    73257554.
 9 FAIL    1998     6.95   53083540.
10 FAIL    1999     5.12   72804781.
# ℹ 38 more rows

Recap: Group by, summarize, mutate

You can un group with ungroup()

bechdel |>
  group_by(binary, year) |>
  summarise(mean_roi = mean(roi, na.rm = TRUE), 
            mean_budget = mean(budget_2013)) |>
  ungroup()
`summarise()` has grouped output by 'binary'. You can override using
the `.groups` argument.
# A tibble: 48 × 4
   binary  year mean_roi mean_budget
   <chr>  <dbl>    <dbl>       <dbl>
 1 FAIL    1990    13.1    73101991.
 2 FAIL    1991     6.71   74299353 
 3 FAIL    1992    46.3    38883983.
 4 FAIL    1993     6.36   49099374.
 5 FAIL    1994    23.7    57812022.
 6 FAIL    1995     5.69   68251510.
 7 FAIL    1996     3.82   68600475.
 8 FAIL    1997    16.5    73257554.
 9 FAIL    1998     6.95   53083540.
10 FAIL    1999     5.12   72804781.
# ℹ 38 more rows

More about the pipe

  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)
[1] 3


More about the pipe

  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)
[1] 3


1 |> 
  sum(2)
[1] 3

Pipe + ggplot() !!

ggplot(bechdel, aes(x = budget_2013)) +
  geom_boxplot()

Pipe + ggplot() !!

bechdel |>
  ggplot(aes(x = budget_2013)) +
  geom_boxplot()

Why is this useful?

  • We can do data transformation immediately followed by a plot!

  • Normally, even if we are just plotting, we use the pipe with ggplot().

Plot + data transform

bechdel
# A tibble: 1,615 × 7
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

Plot + data transform

bechdel |>
  mutate(budget_in_millions = budget_2013/1000000)
# A tibble: 1,615 × 8
   title          year gross_2013 budget_2013    roi binary clean_test
   <chr>         <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over      2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D       2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a S…  2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns         2013  208105475    61000000  3.41  FAIL   notalk    
 5 42             2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin       2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day t…  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time     2013  102648667    12000000  8.55  PASS   ok        
 9 Admission      2013   36014634    13000000  2.77  PASS   ok        
10 After Earth    2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows
# ℹ 1 more variable: budget_in_millions <dbl>

Plot + data transform

bechdel |>
  mutate(budget_in_millions = budget_2013/1000000) |>
  ggplot(aes(x = budget_in_millions))

Exploratory Data Analysis

What is exploratory data analysis (EDA)??

  • Basically everything we have done so far

  • Making plots and computing summary statistics (proportions, means, IQR, etc.) to help explore the data

AE 05

Image source

Assignment

Let’s make a tiny data frame to use as an example:

library(tidyverse)
df <- tibble(x = c(1, 2, 3, 4, 5), y = c("a", "a", "b", "c", "c"))
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 a    
3     3 b    
4     4 c    
5     5 c    

Assignment

Do something and show me

df |>
  mutate(x = x * 2)
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 a    
3     3 b    
4     4 c    
5     5 c    

Do something and save result

df <- df |>
  mutate(x = x * 2)
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    

Assignment

Do something, save result, overwriting original

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  mutate(x = x * 2)
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    

Do something, save result, not overwriting original

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df_new <- df |>
  mutate(x = x * 2)
df_new
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    

Assignment

Do something, save result, overwriting original when you shouldn’t

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  group_by(y) |>
  summarize(mean_x = mean(x))
df
# A tibble: 3 × 2
  y     mean_x
  <chr>  <dbl>
1 a        1.5
2 b        3  
3 c        4.5

Do something, save result, not overwriting original when you shouldn’t

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df_summary <- df |>
  group_by(y) |>
  summarize(mean_x = mean(x))
df_summary
# A tibble: 3 × 2
  y     mean_x
  <chr>  <dbl>
1 a        1.5
2 b        3  
3 c        4.5

Assignment

Do something, save result, overwriting original
data frame

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  mutate(z = x + 2)
df
# A tibble: 5 × 3
      x y         z
  <dbl> <chr> <dbl>
1     1 a         3
2     2 a         4
3     3 b         5
4     4 c         6
5     5 c         7

Do something, save result, overwriting original
column

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  mutate(x = x + 2)
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     3 a    
2     4 a    
3     5 b    
4     6 c    
5     7 c