Midterm Review

Lecture 11

Published

May 30, 2025

Announcements

Office hours today from 1:00-3:00.
Be in class and lab on Monday!!!!

Review

group_by/summarize/mutate
joins
pivots

Computing summary stats

With summarize: condenses information into smaller result table
With mutate: adds a new column to the data
With groupby: results for each provided group

Sample Data

df <- tibble(
    C1 = c("A", "A", "A", "B", "B"),
    C2 = c("X", "Y", "X", "X", "Y"),
    C3 = 1:5
)

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

Mutate vs. Summarize

Compute the mean of C3.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  mutate(mean_3 = mean(C3))

# A tibble: 5 × 4
  C1    C2       C3 mean_3
  <chr> <chr> <int>  <dbl>
1 A     X         1      3
2 A     Y         2      3
3 A     X         3      3
4 B     X         4      3
5 B     Y         5      3

Mutate vs. Summarize

Compute the mean of C3.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  mutate(mean_3 = mean(C3))

# A tibble: 5 × 4
  C1    C2       C3 mean_3
  <chr> <chr> <int>  <dbl>
1 A     X         1      3
2 A     Y         2      3
3 A     X         3      3
4 B     X         4      3
5 B     Y         5      3

df |>
  summarise(mean_3 = mean(C3))

# A tibble: 1 × 1
  mean_3
   <dbl>
1      3

Grouping

Group by C1.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  group_by(C1)

# A tibble: 5 × 3
# Groups:   C1 [2]
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

Grouping

Group by C1 and C2.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  group_by(C1, C2)

# A tibble: 5 × 3
# Groups:   C1, C2 [4]
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

Mutate vs. Summarize (with grouping)

Compute the mean of C3 by value of C1.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  group_by(C1) |>
  mutate(mean_3 = mean(C3))

# A tibble: 5 × 4
# Groups:   C1 [2]
  C1    C2       C3 mean_3
  <chr> <chr> <int>  <dbl>
1 A     X         1    2  
2 A     Y         2    2  
3 A     X         3    2  
4 B     X         4    4.5
5 B     Y         5    4.5

Mutate vs. Summarize (with grouping)

Compute the mean of C3 by value of C1.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  group_by(C1) |>
  mutate(mean_3 = mean(C3))

# A tibble: 5 × 4
# Groups:   C1 [2]
  C1    C2       C3 mean_3
  <chr> <chr> <int>  <dbl>
1 A     X         1    2  
2 A     Y         2    2  
3 A     X         3    2  
4 B     X         4    4.5
5 B     Y         5    4.5

df |>
  group_by(C1) |>
  summarise(mean_3 = mean(C3))

# A tibble: 2 × 2
  C1    mean_3
  <chr>  <dbl>
1 A        2  
2 B        4.5

Mutate vs. Summarize (with grouping)

Compute the mean of C3 by value of C1 and C2.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  group_by(C1, C2) |>
  mutate(mean_3 = mean(C3))

# A tibble: 5 × 4
# Groups:   C1, C2 [4]
  C1    C2       C3 mean_3
  <chr> <chr> <int>  <dbl>
1 A     X         1      2
2 A     Y         2      2
3 A     X         3      2
4 B     X         4      4
5 B     Y         5      5

Mutate vs. Summarize (with grouping)

Compute the mean of C3 by value of C1 and C2.

df

# A tibble: 5 × 3
  C1    C2       C3
  <chr> <chr> <int>
1 A     X         1
2 A     Y         2
3 A     X         3
4 B     X         4
5 B     Y         5

df |>
  group_by(C1, C2) |>
  mutate(mean_3 = mean(C3))

# A tibble: 5 × 4
# Groups:   C1, C2 [4]
  C1    C2       C3 mean_3
  <chr> <chr> <int>  <dbl>
1 A     X         1      2
2 A     Y         2      2
3 A     X         3      2
4 B     X         4      4
5 B     Y         5      5

df |>
  group_by(C1, C2) |>
  summarise(mean_3 = mean(C3))

`summarise()` has grouped output by 'C1'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   C1 [2]
  C1    C2    mean_3
  <chr> <chr>  <dbl>
1 A     X          2
2 A     Y          2
3 B     X          4
4 B     Y          5

More about summarize and grouping

df |>
  group_by(C1, C2) |>
  summarise(mean_3 = mean(C3))

`summarise()` has grouped output by 'C1'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   C1 [2]
  C1    C2    mean_3
  <chr> <chr>  <dbl>
1 A     X          2
2 A     Y          2
3 B     X          4
4 B     Y          5

df |>
  group_by(C1, C2) |>
  summarise(mean_3 = mean(C3), 
            .groups = "drop")

# A tibble: 4 × 3
  C1    C2    mean_3
  <chr> <chr>  <dbl>
1 A     X          2
2 A     Y          2
3 B     X          4
4 B     Y          5

More about summarize and grouping

df |>
  group_by(C1, C2) |>
  summarise(mean_3 = mean(C3)) |>
  summarise(mean_again = mean(mean_3))

`summarise()` has grouped output by 'C1'. You can override using the
`.groups` argument.

# A tibble: 2 × 2
  C1    mean_again
  <chr>      <dbl>
1 A            2  
2 B            4.5

df |>
  group_by(C1, C2) |>
  summarise(mean_3 = mean(C3), 
            .groups = "drop")|>
  summarise(mean_again = mean(mean_3))

# A tibble: 1 × 1
  mean_again
       <dbl>
1       3.25

Joins

Joining: Example Data

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

Joining: Left Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

left_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2
3	X3	NA

Joining: Left Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

left_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2
3	X3	NA

Joining: Right Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

right_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2
4	NA	Y4

Joining: Right Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

right_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2
4	NA	Y4

Joining: Full Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

full_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2
3	X3	NA
4	NA	Y4

Joining: Full Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

full_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2
3	X3	NA
4	NA	Y4

Joining: Inner Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

inner_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2

Joining: Inner Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

inner_join(df_X, df_Y)

id	X	Y
1	X1	Y1
2	X2	Y2

Joining: Semi Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

semi_join(df_X, df_Y)

id	X
1	X1
2	X2

Joining: Semi Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

semi_join(df_X, df_Y)

id	X
1	X1
2	X2

Joining: Anti Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

anti_join(df_X, df_Y)

id	X
3	X3

Joining: Anti Join

df_X

id	X
1	X1
2	X2
3	X3

df_Y

id	Y
1	Y1
2	Y2
4	Y4

anti_join(df_X, df_Y)

id	X
3	X3

Summary of Join Types

Try it!

df_contacts

person	phone
Marie	1
Katie	2
John	3
John	4

df_texts

phone	texts
1	100
2	700
4	400
5	50

Remember…

Remember: sometimes we need to specify which columns we are joining with.

Review: Pivot Longer

How do we go from this…

# A tibble: 4 × 6
  degree `2011` `2012` `2013` `2014` `2015`
  <fct>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 AB2        NA      1     NA     NA      4
2 AB          2      2      4      1      3
3 BS2         2      6      1     NA      5
4 BS          5      9      4     13     10

…to this?

# A tibble: 56 × 3
  degree  year     n
  <fct>  <dbl> <dbl>
1 AB2     2011    NA
2 AB2     2012     1
3 AB2     2013    NA
4 AB2     2014    NA
5 AB2     2015     4
# ℹ 51 more rows

Reverse It: Pivot Wider

How do we get back to this…

# A tibble: 4 × 6
  degree `2011` `2012` `2013` `2014` `2015`
  <fct>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 AB2        NA      1     NA     NA      4
2 AB          2      2      4      1      3
3 BS2         2      6      1     NA      5
4 BS          5      9      4     13     10

… from this?

# A tibble: 56 × 3
  degree  year     n
  <fct>  <dbl> <dbl>
1 AB2     2011    NA
2 AB2     2012     1
3 AB2     2013    NA
4 AB2     2014    NA
5 AB2     2015     4
# ℹ 51 more rows

AE

Kahoot! See website for questions.