AE 09: Importing Data

Getting started

Packages

We will use the following two packages in this application exercise.

tidyverse: For data import, wrangling, and visualization.
readxl: For importing data from Excel.

library(tidyverse)
library(readxl)

Part 1: Hollywood relationships

Load the data from age-gaps.csv in your data and assign it to age_gaps. Confirm that this new object appears in your Environment tab. Click on the name of the object in your Environment tab to pop open the data in the data viewer.

age_gaps <- read_csv("data/age-gaps.csv")

Rows: 1155 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): movie_name, director, actor_1_name, actor_2_name, character_1_gend...
dbl  (5): release_year, age_difference, couple_number, actor_1_age, actor_2_age
date (2): actor_1_birthdate, actor_2_birthdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create a subset of the data frame for heterosexual relationships on screen. Save this into a new pipeline.

age_gaps_mw <- age_gaps |> 
  filter(character_1_gender != character_2_gender )

Split the data for heterosexual relationships into three – where woman is older, where man is older, where they are the same age. Save these subsets as three appropriately named data frames. Confirm that these new objects appear in your Environment tab and that the sum of the number of observations in the two new data frames add to the number of observations in the original data frame.

Note

If you are stuck, here is an idea on a process to get started:

Use mutate to create a variable that tells which case the row has: older woman, older man, or same age
filter the data frame based on the previously created variable

age_gaps_mw <- age_gaps_mw |>
  mutate(
    older = case_when(
      character_1_gender == "woman" & actor_1_age > actor_2_age ~ "woman older",
      character_2_gender == "woman" & actor_2_age > actor_1_age ~ "woman older",
      character_1_gender == "man"   & actor_1_age > actor_2_age ~ "man older",
      character_2_gender == "man"   & actor_2_age > actor_1_age ~ "man older",
      actor_1_age == actor_2_age ~ "same age"
    )
  )

man_older <- age_gaps_mw |> filter(older == "man older")
woman_older <- age_gaps_mw |> filter(older == "woman older")
same_age <- age_gaps_mw |> filter(older == "same age")

(nrow(man_older) + nrow(woman_older) + nrow(same_age)) == nrow(age_gaps_mw)

[1] TRUE

Write the three new datasets you created as .csv files in the data folder:

write_csv(woman_older, "data/woman-older.csv")
# do the same thing for the two other

s## Part 2: Sales

Sales data are stored in an Excel file that looks like the following:

Read in the Excel file called sales.xlsx from the data-raw/ folder such that it looks like the following.

Fill in the blanks in the following code to accomplish this.

sales_raw <- read_excel(
  "data/sales.xlsx",
  skip = 3,
  col_names = c("id", "n")
  )

Tip

The skip and col_names attributes are very useful for reading in messy data!

skip tells R how many rows at the top of the file the function should ignore
col_names tells R what to name the columns it imports

Stretch goal: Manipulate the sales data such such that it looks like the following.

sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )|>
  fill(brand)|>
  filter(!is_brand_name)|>
  select(brand, id, n)

# A tibble: 7 × 3
  brand   id    n    
  <chr>   <chr> <chr>
1 Brand 1 1234  8    
2 Brand 1 8721  2    
3 Brand 1 1822  3    
4 Brand 2 3333  1    
5 Brand 2 2156  3    
6 Brand 2 3987  6    
7 Brand 2 3216  5

Why should we bother with writing code for reading the data in by skipping columns and assigning variable names as well as cleaning it up in multiple steps instead of opening the Excel file and editing the data in there to prepare it for a clean import?