Data Importing

Lecture 9

Published

May 28, 2025

Midterm Upcoming

Practice + some review + many more details tomorrow in class
Initial practice questions are posted!

Lab Check In

Lab 1 Grades

Please read comments on ALL questions. Some mistakes we did not penalize for, but could in future problems.
In the first few questions, the point adjustment feature was used for partial credit.
Regrade requests: submit within a week if you believe a grading mistake was made.
Code style was the biggest common issue.
- Even though most questions had code style points, we did not double penalize on this lab.
- There were some minor problems we did not take off for this time but left comments about.

Code Style

spaces before and line breaks after each + when building a ggplot,
spaces before and line breaks after each |> in a data transformation pipeline,
code should be properly indented
- this should match the automatic indentation when you hit enter
- to fix: highlight code, click code option on top of RStudio, select reindent lines
there should be spaces around = signs and spaces after commas
use |> and not %>%
use <- and not = for saving a data frame

Lab 2 clarifications

Question 7c: No code needed! Use results 7a and 7b to answer. You can delete the code chunk from the template!
Question 10b: ‘filter your reshaped dataset from question 10’ should be from question 9
Question 2c: ‘How is the output different from the one in part (a)?’ should say from part (b)

Lab 2 Workflow question

Select pages on gradescope when you submit
At least three commits to your lab 2 repo

These should hopefully be free points!!!

Lab 2 Question 5

What is a ‘typical value’ of population density?

Lab 2 Question 5

# A tibble: 1 × 2
   mean   med
  <dbl> <dbl>
1 3098. 1156.

Let’s zoom out for a second

Data science and statistical thinking

Before Midterm 1…

Data science: the real-world art of transforming messy, imperfect, incomplete data into knowledge;

After Midterm 1…

Statistics: the mathematical discipline of quantifying our uncertainty about that knowledge.

Data science

Data science

Collection: we won’t seriously study this; data importing coming today

for us: package data (library()), data importing (read_csv), and webscraping (eventually)
but really: domain-specific issues of measurement, survey design, experimental design, etc

Data science

Collection: we won’t seriously study this; data importing coming today

Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.

keywords: mutate, fct_relevel, pivot_*, *_join

Data science

Collection: we won’t seriously study this; data importing coming today
Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.

Analysis: finally transform the data into knowledge…

pictures: ggplot, geom_*, etc
numerical summaries: summarize, group_by, count, mean, median, sd, quantile, IQR, cor, etc
The pictures and the summaries need to work together!

Reading data into R

Package data

When data is in a pack, such as tidyverse, loading the pacakge gets our dataset
Most often, this is not the case

Reading in rectangular data

Reading rectangular data

Using readr: (in tidyverse)
- Most commonly: read_csv() - file saved as .csv
- Maybe also: read_tsv(), read_delim(), etc - other file formats

. . .

Using readxl: read_excel() - .xls or .xlsx

. . .

Using googlesheets4: read_sheet() – We haven’t covered this in the videos, but might be useful for your projects

Using read_csv()

Generally, the format is:

r_df_name <- read_csv("path_to_file_name.csv")

Path to file

Where is durham-climate.csv?

use / to separate folder(s) + file names; file path in quotes
Answer:

Why not include `ae-mneubrander`?

Where is durham-climate.csv?

We can also write files!

This allows us to save data for later usage, sharing outside of R, etc.

Using write_csv():

write_csv(r_df_name, "path_to_file.csv")

Application exercise

Goal 1.1: Reading and writing CSV files

Read a CSV file with tidy data
Split it into subsets based on features of the data
Write out subsets as CSV files

Goal 1.2: Practice - Case When

case_when() is similar to if_else(), but allows multiple cases
case_when() is often used in mutate() to make a new column

df |>
  mutate(new_var = case_when(
    condition_1 ~ result_1,
    condition_2 ~ result_2,
    condition_3 ~ result_3,
    ...,
    .default = default_result
  ))

Age gap in Hollywood relationships

Goal 2.1: Reading Excel files & non-tidy data

Read an Excel file with non-tidy data
Tidy it up!

Goal 2.2: String Functions

We’ve seen lots of functions that deal with numeric data (mean, median, sum, etc.) - what about characters?

stringr is a tidyverse package with lots of functions for dealing with character strings
today: str_detect in stringr

Goal 2.2: String Functions

str_detect() identifies if some characters are a substring of a larger string
useful in cases when you need to check some condition, for example:
- in a filter()
- in an if_else() or case_when()

Goal 2.2: String Functions

str_detect() identifies if some characters are a substring of a larger string
useful in cases when you need to check some condition, for example:
- in a filter()
- in an if_else() or case_when()

example: which classes in a list are in the stats department?

classes <- c("sta199", "dance122", "math185", "sta240", "pubpol202")
str_detect(classes, "sta")

[1]  TRUE FALSE FALSE  TRUE FALSE

Goal 2.2: String Functions

General form:

str_detect(character_var, "word_to_detect")

Sales data

. . .

Are these data tidy? Why or why not?

Sales data

What “data moves” do we need to go from the original, non-tidy data to this, tidy one?

Midterm Upcoming

Lab Check In

Lab 1 Grades

Code Style

Lab 2 clarifications

Lab 2 Workflow question

Lab 2 Question 5

Lab 2 Question 5

Let’s zoom out for a second

Data science and statistical thinking

Data science

Data science

Data science

Data science

Reading data into R

Package data

Reading in rectangular data

Reading rectangular data

Using read_csv()

Path to file

Why not include ae-mneubrander?

We can also write files!

Application exercise

Goal 1.1: Reading and writing CSV files

Goal 1.2: Practice - Case When

Age gap in Hollywood relationships

Goal 2.1: Reading Excel files & non-tidy data

Goal 2.2: String Functions

Goal 2.2: String Functions

Goal 2.2: String Functions

Goal 2.2: String Functions

Sales data

Sales data

Why not include `ae-mneubrander`?