Lab 3

Everything So Far

Lab
Due Sun Jun 1 11:59 PM

Introduction

In this lab, you’ll review topics you’ve worked with in previous labs and practice new topics we have learned since the last lab.

This lab assumes you’ve completed Lab 1 and Lab 2 and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review them before starting with this one. The same ideas apply.

Code Guidelines

As we’ve discussed in the lecture, your plots should include an informative title, axes and legends should have human-readable labels and aesthetic choices should be carefully considered.

Titles that run off the page are NOT human readable. Long titles can be split into mutliple lines by adding “” into the string.

Additionally, code should follow the tidyverse style. Particularly,

  • there should be spaces before and line breaks after each + when building a ggplot,

  • there should also be spaces before and line breaks after each |> in a data transformation pipeline,

  • code should be properly indented,

  • there should be spaces around = signs and spaces after commas

  • use |> instead of %>%

  • when saving a data frame, use <- instead of =

Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.1

As you complete the lab and other assignments in this course, remember to develop a sound workflow for reproducible data analysis. This assignment will periodically remind you to render, commit, and push your changes to GitHub.

Getting started

Log in to RStudio, clone your lab-3 repo from GitHub, open your lab-3.qmd document, and get started!

Step 1: Log in to RStudio

  • Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
  • Click STA198-199 under My reservations to log into your container. You should now see the RStudio environment.

Step 2: Clone the repo & start a new RStudio project

  • Go to the course organization at github.com/sta199-su25 organization on GitHub. Click on the repo with the prefix lab-3. It contains the starter documents you need to complete the lab.

  • Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.

  • In RStudio, go to FileNew ProjectVersion ControlGit.

  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.

  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

  • Click lab-3.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Step 3: Update the YAML

In lab-3.qmd, update the author field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.

Important

If you run into any issues with the first steps outlined above, flag someone for help before proceeding.

Packages

In this lab, we will work with the

  • tidyverse package for doing data analysis in a “tidy” way,
  • readxl package for reading in Excel files,
  • janitor package for cleaning up variable names, and
  • palmerpenguins and datasauRus packages for some datasets

Part 1: datasauRus

The data frame you will be working with in this part is called datasaurus_dozen and it’s in the datasauRus package. This single data frame contains 13 datasets, designed to show us why data visualization is important and how summary statistics alone can be misleading. The different datasets are marked by the dataset variable, as shown in Figure 1.

Figure 1: The `datasaurus_dozen` data frame stacks 13 datasets on top of each other. This figure shows the first three datasets.
Note

If it’s confusing that the data frame is called datasaurus_dozen when it contains 13 datasets, you’re not alone! Have you heard of a baker’s dozen?

Here is a peek at the dataset:

datasaurus_dozen 
# A tibble: 1,846 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 dino     55.4  97.2
 2 dino     51.5  96.0
 3 dino     46.2  94.5
 4 dino     42.8  91.4
 5 dino     40.8  88.3
 6 dino     38.7  84.9
 7 dino     35.6  79.9
 8 dino     33.1  77.6
 9 dino     29.0  74.5
10 dino     26.2  71.4
# ℹ 1,836 more rows

Question 1

  1. In a single pipeline, calculate the mean of x, mean of y, standard deviation of x, standard deviation of y, and the correlation between x and y for each level of the dataset variable. Make sure the results are visible for all datasets. Then, in 1-2 sentences, comment on how these summary statistics compare across groups (datasets).
Note
  • For a reminder of what standard deviation is, refer to May 19th’s textbook reading: IMS chapter 5. This concept is fair game for the midterm. Math details are not needed, but you should understand the big picture idea.

  • Correlation is a new term. We will learn more details about this after the midterm, but the idea is that it measures the strength and direction of a linear relationship. Values range from -1 (perfect negative linear relationship) to 1 (perfect positive linear relationship). Values near 0 have little/no linear trend. This concept will not be tested on the midterm.

Note

The datasauRus data is stored in a tibble: this is a special class of data frame used in the tidyverse. All tibbles are dataframes (with some extra features!), but not vice versa. By default, tibbles only print out 10 rows (whereas other dataframes print out all rows).

The result of the summarize should have 13 rows. To display all rows, add print(n = 13) as the last step of your pipeline.

  1. Create a scatterplot of y versus x and color and facet it by dataset. Then, in 1-2 sentences, briefly discuss how these plots compare across groups (datasets). How does your response in this question compare to your response to the previous question and what does this say about using visualizations and summary statistics when getting to know a dataset?
Tip

When you both color and facet by the same variable, you’ll end up with a redundant legend. Turn off the legend by adding show.legend = FALSE to the geom creating the legend.

Part 2: mtcars

In this part, you’ll work with one of the most basic and overused datasets in R: mtcars. The data in this dataset come from the 1974 Motor Trend US magazine (so, yes, they’re old!) and provide information on fuel efficiency and other car characteristics.

Question 2

Since this data set is used in many code examples, it’s not unexpected that some analyses of the data are good and some not so much.

Tip

For both parts of this question, you should review the data dictionary that is in the documentation for the dataset which you can find at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html or by typing ?mtcars in your Console.

a. You come across the following visualization of these data.

ggplot(mtcars, aes(x = wt, y = mpg, color = am)) +
  geom_point() +
  labs(
    x = "Weight (1000 lbs)",
    y = "Miles / gallon"
  )

First, determine what is wrong with this visualization and describe it in one sentence. Then, fix it andimprove the visualization. As part of your improvement, make sure your legend

  • is on top of the plot,
  • is informative, and
  • lists levels in the order they appear in the plot (that is, the level with more points on the left should be on the left of the legend)

b.Update your plot from part (a) further, this time using different shaped points for cars with V-shaped and straight engines. The legend should be again be informative, but this time, on the right of the plot.

Question 3

Your task is to make your plot from Question 2b as ugly and as ineffective as possible. Like, seriously. I want something that is apocalyptically heinous. Change colors, axes, fonts, themes, or anything else you can think of. You can search online for other themes, fonts, etc. that you want to tweak. The sky is truly the limit. I want Voldemort, Cruella de Vil, and Regina George all nodding in approval at how horrendously evil your plot is.

You must make at least 5 updates to the plot, and your answer must include:

  • a list of the at least 5 updates you’ve made to your plot from Question 2b, and

  • 1-2 sentence explanation of why the plot you created is ugly (to you, at least) and ineffective.

Tip

The tidyverse documentation on themes may give you ideas about how you can alter the various features of the plot.

Part 3: Olympians

For this part, you’ll work with data from the rosters of Team USA from the 2020 and 2024 Olympics. The data come from https://www.teamusa.com and the rosters for the two games are in a single Excel file (team-usa.xlsx in your data folder), accross two separate spreadsheets within that file. Figure 2 shows screenshots of these spreadsheets.

Team USA 2020

Team USA 2024
Figure 2: Excel file with two sheets for rosters of Team USA 2020 and 2024.

Your goal is to answer questions about athletes who competed in both games and only one of the games.

Question 4

Read in and save the data from the Excel document as two separate data frames: team_usa_2020 and team_usa_2024. Then, display the first 5 rows of each.

Tip

Note that the data is stored in two sheets in the same Excel document. The read_excel function has a sheet argument that allows you to specify which sheet you want to read from.

Question 5

  1. Now, we want to clean up the column names of the data (for instance, remove spaces to make them easier to deal with). Check out the documentation for the clean_names()function from the janitor package at https://sfirke.github.io/janitor/reference/clean_names.html.


    Use this function (with default settingts) to “clean” the variable names of team_usa_2020 and team_usa_2024, saving the data frames with the new variable names into the original data frame names.

  2. In each data set, save a new variable that:

    • paste()s together the first_name and last_name variables with a space in between and

    • is the first column of the data frame (hint: check out relocate() ).

Question 6

  1. Using an appropriate `*_join()` function, determine how many athletes participated in both Olympic Games. Make sure to show the result of the join in a way that makes the resulting number clear (for instance, the number of rows in the joined data frame) AND state the result in a sentence.
Important

Your answer to this question, based on the data frames you created, should be 0, even if it doesn’t make sense in context of actual Olympic athletes. We will investigate this in further parts.

  1. If you have even a passing knowledge of the Olympic Games, you might know that there are some athletes that participated in both the 2020 and 2024 games, e.g., Simone Biles, Katie Ledecky, etc.


    The reason why athlete names didn’t match across the two data frames is that in one data frame, names are in UPPER CASE, and in the other, they’re in Title Case. Update the 2020 data frame to make `name` all upper case. Display the first 10 rows of `team_usa_2020` with upper case names.

Important

Your answer must use the str_to_upper() function.

  1. Let’s try that question again: How many athletes participated in both Olympic Games?

  2. How many athletes participated in the 2020 Olympic Games but not the 2024 Olympic Games? How many athletes participated in the 2024 Olympic Games but not the 2020 Olympic Games? Use appropriate joins to arrive at these answers. Make sure to show the result of the join in a way that makes the resulting numbers clear AND state them in a sentence.

Question 7

Now, let’s focus on the 2024 Olympic data in your team_usa_2024 data frame.

  1. States of interest: Choose 2-5 states of interest and save them in a vector called states_of_interest. For example, if I chose California, Louisiana, and North Carolina, this would be:

    states_of_interest <- c("Alabama", "Louisiana", "North Carolina")

    Then, filter the team_usa_2024 result to only include Olympians from these states, saving the results as team_usa_2024_soi .

  2. In a single pipeline, use team_usa_2024_soi to determine how many Olympians from these states went to the 2024 Olympics, saving the results in data frame state_counts. Then, in a separate pipeline, display the results in increasing order of number of Olympians.

  3. Below are two different approaches to creating a bar chart representing the same information as in part (b).

    The first uses geom_bar(), as we have seen before. The second uses an alternative function for creating bar charts: geom_col().

    Run both blocks of code. Discuss what is different between the two pieces of code and the difference between what geom_bar and geom_col need to produce the same plot.

# geom_bar:
team_usa_2024_soi |> ggplot(aes(x = hometown_state)) + 
  geom_bar()


# geom_col:
state_counts |> ggplot(aes(x = hometown_state, y = n)) + 
  geom_col()
  1. Create a visualization that compares the proportions of male/female Olympians across your states of interest. In 1-2 sentences, briefly discuss the results. Do any of them surprise you?