Midterm + More Practice!

Lecture 10

Published

May 29, 2025

Announcements

  • I will not hold office hours on Tuesday, June 3rd: this is when you have your take home!
  • Replacement midterm review office hours: June 2nd (3:30 - 5:30)
  • Be in class and lab on Monday!!!!
  • Tomorrow: Review game!

Midterm Exam 1

Worth 20% of your final grade; consists of two parts:

  • In-class: worth 70% of the Midterm 1 grade;

  • Take-home: worth 30% of the Midterm 1 grade.

Material

Everything we have done so far:

  • plotting data with ggplot and interpreting plots
  • computing and understanding summary statistics
  • transforming data (row/column/grouping operations)
  • pivoting & joining data
  • data types/classes
  • importing data
  • Monday’s class: data science ethics

In-class

  • All multiple choice

  • You get both sides of one 8.5” x 11” note sheet that you and only you created (written, typed, iPad, etc)

. . .

Important

If you have testing accommodations, make sure I get proper documentation from SDAO and make appointments in the Testing Center ASAP. The appointment should overlap substantially with our class time if possible.

What should I put on my cheat sheet?

  • description of common functions;
  • examples of function usage;
  • description of different visualizations: how to interpret, and what to use when;
  • doodles;
  • cute words of affirmation.

. . .

Warning

Don’t waste space on the details of any specific applications or datasets we’ve seen (penguins, Bechdel, gerrymandering, midwest, etc). Anything we want you to know about a particular application will be introduced from scratch within the exam.

Example in-class question

Which command can replace a pre-existing column in a data frame with a new and improved version of itself?

  1. group_by
  2. summarize
  3. pivot_wider
  4. geom_replace
  5. mutate

Example in-class question

df
# A tibble: 6 × 2
      x y    
  <dbl> <chr>
1     1 Marie
2     2 Marie
3     3 Katie
4     4 Mary 
5     5 Mary 
6     6 Mary 
df |>
  group_by(y) |>
  summarize(xbar = mean(x))

How many rows will this output have?

  1. 1
  2. 2
  3. 3
  4. 6
  5. 11

Example in-class question

Which box plot is visualizing the same data as the histogram?

Example in-class question

What code could have been used to produce df_result? Select all that apply.

df_X

state year
LA 2025
NC 2025
LA 2024

df_Y

state region
LA south
NC south
CA west

df_result

state year region
LA 2025 south
NC 2025 south
LA 2024 south


  1. left_join(df_X, df_Y)
  2. right_join(df_X, df_Y)
  3. full_join(df_X, df_Y)
  1. anti_join(df_Y, df_X)
  2. right_join(df_Y, df_X)

Take-home

  • It will be just like a lab, but shorter;
  • Completely open-resource, but citation policies apply;
  • Absolutely no collaboration of any kind;
  • Seek help by posting privately on Ed;
  • Submit your final PDF to Gradescope in the usual way.

Reminder: conduct policies

  • Uncited use of outside resources or inappropriate collaboration will result in a zero and be referred to the conduct office;
  • If a conduct violation of any kind is discovered, your final letter grade in the course will be permanently reduced (A- down to B+, B+ down to B, etc);
  • If folks share solutions, all students involved will be penalized equally, the sharer the same as the recipient.

Things you can do to study

  • Practice problems: released Thursday February 13;
  • Attend class tomorrow: review game
  • Old labs: correct parts where you lost points;
  • Old AEs: complete tasks we didn’t get to and compare with key;
  • Code along: watch these videos specifically;
  • Textbook: odd-numbered exercises in the back of IMS Chs. 1, 4, 5, 6

Let’s Practice!

Today’s Goals:

  • Goal 1: Practice data transformation and working with characters - sales data from yesterday
  • Goal 2: ✨Beautify✨ the plot from AE-08: plotting + factors
  • Goal 3: Practice pivoting with AE-08 data

Goal 1: Transform Sales Data

Yesterday: read an Excel file with non-tidy data

Goal 1: Transform Sales Data

Yesterday: read an Excel file with non-tidy data

Goal 1: Transform Sales Data

Goal: tidy up the data

String Functions

We’ve seen lots of functions that deal with numeric data (mean, median, sum, etc.) - what about characters?

  • stringr is a tidyverse package with lots of functions for dealing with character strings

  • today: str_detect in stringr

String Functions

  • str_detect() identifies if some characters are a substring of a larger string

  • useful in cases when you need to check some condition, for example:

    • in a filter()

    • in an if_else() or case_when()

String Functions

  • str_detect() identifies if some characters are a substring of a larger string

  • useful in cases when you need to check some condition, for example:

    • in a filter()

    • in an if_else() or case_when()

example: which classes in a list are in the stats department?

classes <- c("sta199", "dance122", "math185", "sta240", "pubpol202")
str_detect(classes, "sta")
[1]  TRUE FALSE FALSE  TRUE FALSE

String Functions

General form:

str_detect(character_var, "word_to_detect")

Let’s get started!

Open up yesterday’s AE file (AE-09).

Let’s get started!

sales_raw <- read_excel(
  "data/sales.xlsx", 
  skip = 3,
  col_names = c("id", "n")
  )
# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5    

Create Brand Column

sales_raw 
# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5    

Create Brand Column

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand")
  )
# A tibble: 9 × 3
  id      n     is_brand_name
  <chr>   <chr> <lgl>        
1 Brand 1 n     TRUE         
2 1234    8     FALSE        
3 8721    2     FALSE        
4 1822    3     FALSE        
5 Brand 2 n     TRUE         
6 3333    1     FALSE        
7 2156    3     FALSE        
8 3987    6     FALSE        
9 3216    5     FALSE        

Create Brand Column

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )
# A tibble: 9 × 4
  id      n     is_brand_name brand  
  <chr>   <chr> <lgl>         <chr>  
1 Brand 1 n     TRUE          Brand 1
2 1234    8     FALSE         <NA>   
3 8721    2     FALSE         <NA>   
4 1822    3     FALSE         <NA>   
5 Brand 2 n     TRUE          Brand 2
6 3333    1     FALSE         <NA>   
7 2156    3     FALSE         <NA>   
8 3987    6     FALSE         <NA>   
9 3216    5     FALSE         <NA>   

Create Brand Column

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)
# A tibble: 9 × 4
  id      n     is_brand_name brand  
  <chr>   <chr> <lgl>         <chr>  
1 Brand 1 n     TRUE          Brand 1
2 1234    8     FALSE         Brand 1
3 8721    2     FALSE         Brand 1
4 1822    3     FALSE         Brand 1
5 Brand 2 n     TRUE          Brand 2
6 3333    1     FALSE         Brand 2
7 2156    3     FALSE         Brand 2
8 3987    6     FALSE         Brand 2
9 3216    5     FALSE         Brand 2

Keep Needed Rows

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)|>
  filter(!is_brand_name)
# A tibble: 7 × 4
  id    n     is_brand_name brand  
  <chr> <chr> <lgl>         <chr>  
1 1234  8     FALSE         Brand 1
2 8721  2     FALSE         Brand 1
3 1822  3     FALSE         Brand 1
4 3333  1     FALSE         Brand 2
5 2156  3     FALSE         Brand 2
6 3987  6     FALSE         Brand 2
7 3216  5     FALSE         Brand 2

Keep Needed Columns

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)|>
  filter(!is_brand_name)|>
  select(brand, id, n)
# A tibble: 7 × 3
  brand   id    n    
  <chr>   <chr> <chr>
1 Brand 1 1234  8    
2 Brand 1 8721  2    
3 Brand 1 1822  3    
4 Brand 2 3333  1    
5 Brand 2 2156  3    
6 Brand 2 3987  6    
7 Brand 2 3216  5    

Goal 2: Beautify AE-08 Plot

Data:

durham_climate 
# A tibble: 12 × 4
   month     avg_high_f avg_low_f precip
   <chr>          <dbl>     <dbl>  <dbl>
 1 January           49        28   4.45
 2 February          53        29   3.7 
 3 March             62        37   4.69
 4 April             71        46   3.43
 5 May               79        56   4.61
 6 June              85        65   4.02
 7 July              89        70   3.94
 8 August            87        68   4.37
 9 September         81        60   4.37
10 October           71        47   3.7 
11 November          62        37   3.39
12 December          53        30   3.43

Goal 2: Beautify AE-08 Plot

Original Plot:

Goal 2: Beautify AE-08 Plot

Releveling Months:

Goal 2: Beautify AE-08 Plot

Goal:

Goal: Beautify AE-08 Plot

# A tibble: 12 × 4
   month     avg_high_f avg_low_f precip
   <fct>          <dbl>     <dbl>  <dbl>
 1 January           49        28   4.45
 2 February          53        29   3.7 
 3 March             62        37   4.69
 4 April             71        46   3.43
 5 May               79        56   4.61
 6 June              85        65   4.02
 7 July              89        70   3.94
 8 August            87        68   4.37
 9 September         81        60   4.37
10 October           71        47   3.7 
11 November          62        37   3.39
12 December          53        30   3.43

The Code…

Take a look at the printout!W hat does each highlighted portion do?

The Code…

Go ahead and pull today’s AE - mess around with the code.

Goal 3: High/Low lines

Goal 3: High/Low lines

# A tibble: 3 × 5
  month    avg_high_f avg_low_f precip season
  <fct>         <dbl>     <dbl>  <dbl> <fct> 
1 January          49        28   4.45 Winter
2 February         53        29   3.7  Winter
3 March            62        37   4.69 Spring

Pivot!!!

# A tibble: 5 × 5
  month    avg_high_f avg_low_f precip season
  <fct>         <dbl>     <dbl>  <dbl> <fct> 
1 January          49        28   4.45 Winter
2 February         53        29   3.7  Winter
3 March            62        37   4.69 Spring
4 April            71        46   3.43 Spring
5 May              79        56   4.61 Spring


# A tibble: 5 × 5
  month    precip season temp_type   temp
  <fct>     <dbl> <fct>  <chr>      <dbl>
1 January    4.45 Winter avg_high_f    49
2 January    4.45 Winter avg_low_f     28
3 February   3.7  Winter avg_high_f    53
4 February   3.7  Winter avg_low_f     29
5 March      4.69 Spring avg_high_f    62

Pivot!!!

Add your pivot code to today’s AE. Check out the plotting code! What is going on?