Midterm + More Practice!

Lecture 10

Published

May 29, 2025

Announcements

I will not hold office hours on Tuesday, June 3rd: this is when you have your take home!
Replacement midterm review office hours: June 2nd (3:30 - 5:30)
Be in class and lab on Monday!!!!
Tomorrow: Review game!

Midterm Exam 1

Worth 20% of your final grade; consists of two parts:

In-class: worth 70% of the Midterm 1 grade;
Take-home: worth 30% of the Midterm 1 grade.

Material

Everything we have done so far:

plotting data with ggplot and interpreting plots
computing and understanding summary statistics
transforming data (row/column/grouping operations)
pivoting & joining data
data types/classes
importing data
Monday’s class: data science ethics

In-class

All multiple choice
You get both sides of one 8.5” x 11” note sheet that you and only you created (written, typed, iPad, etc)

. . .

Important

If you have testing accommodations, make sure I get proper documentation from SDAO and make appointments in the Testing Center ASAP. The appointment should overlap substantially with our class time if possible.

What should I put on my cheat sheet?

description of common functions;
examples of function usage;
description of different visualizations: how to interpret, and what to use when;
doodles;
cute words of affirmation.

. . .

Warning

Don’t waste space on the details of any specific applications or datasets we’ve seen (penguins, Bechdel, gerrymandering, midwest, etc). Anything we want you to know about a particular application will be introduced from scratch within the exam.

Example in-class question

Which command can replace a pre-existing column in a data frame with a new and improved version of itself?

group_by
summarize
pivot_wider
geom_replace
mutate

Example in-class question

df

# A tibble: 6 × 2
      x y    
  <dbl> <chr>
1     1 Marie
2     2 Marie
3     3 Katie
4     4 Mary 
5     5 Mary 
6     6 Mary

df |>
  group_by(y) |>
  summarize(xbar = mean(x))

How many rows will this output have?

Example in-class question

Which box plot is visualizing the same data as the histogram?

Example in-class question

What code could have been used to produce df_result? Select all that apply.

df_X

state	year
LA	2025
NC	2025
LA	2024

df_Y

state	region
LA	south
NC	south
CA	west

df_result

state	year	region
LA	2025	south
NC	2025	south
LA	2024	south

left_join(df_X, df_Y)
right_join(df_X, df_Y)
full_join(df_X, df_Y)

anti_join(df_Y, df_X)
right_join(df_Y, df_X)

Take-home

It will be just like a lab, but shorter;
Completely open-resource, but citation policies apply;
Absolutely no collaboration of any kind;
Seek help by posting privately on Ed;
Submit your final PDF to Gradescope in the usual way.

Reminder: conduct policies

Uncited use of outside resources or inappropriate collaboration will result in a zero and be referred to the conduct office;
If a conduct violation of any kind is discovered, your final letter grade in the course will be permanently reduced (A- down to B+, B+ down to B, etc);
If folks share solutions, all students involved will be penalized equally, the sharer the same as the recipient.

Things you can do to study

Practice problems: released Thursday February 13;
Attend class tomorrow: review game
Old labs: correct parts where you lost points;
Old AEs: complete tasks we didn’t get to and compare with key;
Code along: watch these videos specifically;
Textbook: odd-numbered exercises in the back of IMS Chs. 1, 4, 5, 6

Let’s Practice!

Today’s Goals:

Goal 1: Practice data transformation and working with characters - sales data from yesterday
Goal 2: ✨Beautify✨ the plot from AE-08: plotting + factors
Goal 3: Practice pivoting with AE-08 data

Goal 1: Transform Sales Data

Yesterday: read an Excel file with non-tidy data

Goal 1: Transform Sales Data

Yesterday: read an Excel file with non-tidy data

Goal 1: Transform Sales Data

Goal: tidy up the data

String Functions

We’ve seen lots of functions that deal with numeric data (mean, median, sum, etc.) - what about characters?

stringr is a tidyverse package with lots of functions for dealing with character strings
today: str_detect in stringr

String Functions

str_detect() identifies if some characters are a substring of a larger string
useful in cases when you need to check some condition, for example:
- in a filter()
- in an if_else() or case_when()

String Functions

str_detect() identifies if some characters are a substring of a larger string
useful in cases when you need to check some condition, for example:
- in a filter()
- in an if_else() or case_when()

example: which classes in a list are in the stats department?

classes <- c("sta199", "dance122", "math185", "sta240", "pubpol202")
str_detect(classes, "sta")

[1]  TRUE FALSE FALSE  TRUE FALSE

String Functions

General form:

str_detect(character_var, "word_to_detect")

Let’s get started!

Open up yesterday’s AE file (AE-09).

Let’s get started!

sales_raw <- read_excel(
  "data/sales.xlsx", 
  skip = 3,
  col_names = c("id", "n")
  )

# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5

Create Brand Column

sales_raw

# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5

Create Brand Column

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand")
  )

# A tibble: 9 × 3
  id      n     is_brand_name
  <chr>   <chr> <lgl>        
1 Brand 1 n     TRUE         
2 1234    8     FALSE        
3 8721    2     FALSE        
4 1822    3     FALSE        
5 Brand 2 n     TRUE         
6 3333    1     FALSE        
7 2156    3     FALSE        
8 3987    6     FALSE        
9 3216    5     FALSE

Create Brand Column

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )

# A tibble: 9 × 4
  id      n     is_brand_name brand  
  <chr>   <chr> <lgl>         <chr>  
1 Brand 1 n     TRUE          Brand 1
2 1234    8     FALSE         <NA>   
3 8721    2     FALSE         <NA>   
4 1822    3     FALSE         <NA>   
5 Brand 2 n     TRUE          Brand 2
6 3333    1     FALSE         <NA>   
7 2156    3     FALSE         <NA>   
8 3987    6     FALSE         <NA>   
9 3216    5     FALSE         <NA>

Create Brand Column

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)

# A tibble: 9 × 4
  id      n     is_brand_name brand  
  <chr>   <chr> <lgl>         <chr>  
1 Brand 1 n     TRUE          Brand 1
2 1234    8     FALSE         Brand 1
3 8721    2     FALSE         Brand 1
4 1822    3     FALSE         Brand 1
5 Brand 2 n     TRUE          Brand 2
6 3333    1     FALSE         Brand 2
7 2156    3     FALSE         Brand 2
8 3987    6     FALSE         Brand 2
9 3216    5     FALSE         Brand 2

Keep Needed Rows

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)|>
  filter(!is_brand_name)

# A tibble: 7 × 4
  id    n     is_brand_name brand  
  <chr> <chr> <lgl>         <chr>  
1 1234  8     FALSE         Brand 1
2 8721  2     FALSE         Brand 1
3 1822  3     FALSE         Brand 1
4 3333  1     FALSE         Brand 2
5 2156  3     FALSE         Brand 2
6 3987  6     FALSE         Brand 2
7 3216  5     FALSE         Brand 2

Keep Needed Columns

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)|>
  filter(!is_brand_name)|>
  select(brand, id, n)

# A tibble: 7 × 3
  brand   id    n    
  <chr>   <chr> <chr>
1 Brand 1 1234  8    
2 Brand 1 8721  2    
3 Brand 1 1822  3    
4 Brand 2 3333  1    
5 Brand 2 2156  3    
6 Brand 2 3987  6    
7 Brand 2 3216  5

Goal 2: Beautify AE-08 Plot

Data:

durham_climate

# A tibble: 12 × 4
   month     avg_high_f avg_low_f precip
   <chr>          <dbl>     <dbl>  <dbl>
 1 January           49        28   4.45
 2 February          53        29   3.7 
 3 March             62        37   4.69
 4 April             71        46   3.43
 5 May               79        56   4.61
 6 June              85        65   4.02
 7 July              89        70   3.94
 8 August            87        68   4.37
 9 September         81        60   4.37
10 October           71        47   3.7 
11 November          62        37   3.39
12 December          53        30   3.43

Goal 2: Beautify AE-08 Plot

Original Plot:

Goal 2: Beautify AE-08 Plot

Releveling Months:

Goal 2: Beautify AE-08 Plot

Goal:

Goal: Beautify AE-08 Plot

# A tibble: 12 × 4
   month     avg_high_f avg_low_f precip
   <fct>          <dbl>     <dbl>  <dbl>
 1 January           49        28   4.45
 2 February          53        29   3.7 
 3 March             62        37   4.69
 4 April             71        46   3.43
 5 May               79        56   4.61
 6 June              85        65   4.02
 7 July              89        70   3.94
 8 August            87        68   4.37
 9 September         81        60   4.37
10 October           71        47   3.7 
11 November          62        37   3.39
12 December          53        30   3.43

The Code…

Take a look at the printout!W hat does each highlighted portion do?

The Code…

Go ahead and pull today’s AE - mess around with the code.

Goal 3: High/Low lines

Goal 3: High/Low lines

# A tibble: 3 × 5
  month    avg_high_f avg_low_f precip season
  <fct>         <dbl>     <dbl>  <dbl> <fct> 
1 January          49        28   4.45 Winter
2 February         53        29   3.7  Winter
3 March            62        37   4.69 Spring

Pivot!!!

# A tibble: 5 × 5
  month    avg_high_f avg_low_f precip season
  <fct>         <dbl>     <dbl>  <dbl> <fct> 
1 January          49        28   4.45 Winter
2 February         53        29   3.7  Winter
3 March            62        37   4.69 Spring
4 April            71        46   3.43 Spring
5 May              79        56   4.61 Spring

# A tibble: 5 × 5
  month    precip season temp_type   temp
  <fct>     <dbl> <fct>  <chr>      <dbl>
1 January    4.45 Winter avg_high_f    49
2 January    4.45 Winter avg_low_f     28
3 February   3.7  Winter avg_high_f    53
4 February   3.7  Winter avg_low_f     29
5 March      4.69 Spring avg_high_f    62

Pivot!!!

Add your pivot code to today’s AE. Check out the plotting code! What is going on?