Meet the Toolkit

Lecture 1

May 15, 2025

Reminders

  • If you have not yet finished the Getting to Know You survey, please do so ASAP!

  • Make your appointments in the Testing Center now!

  • Any questions about the syllabus??

Today: Course toolkit

Ask so many questions!!!

Course toolkit

Course operation

  • Materials: sta199-f24.github.io
  • Submission: Gradescope
  • Discussion: Ed Discussion
  • Gradebook: Canvas

Doing data science

  • Computing:
    • R
    • RStudio
    • tidyverse
    • Quarto
  • Version control and collaboration:
    • Git
    • GitHub

Learning goals

By the end of the course, you will be able to…

  • gain insight from data
  • gain insight from data, reproducibly
  • gain insight from data, reproducibly, using modern programming tools and techniques
  • gain insight from data, reproducibly and collaboratively, using modern programming tools and techniques
  • gain insight from data, reproducibly (with literate programming and version control) and collaboratively, using modern programming tools and techniques

Reproducible data analysis

Reproducibility checklist

How do we make sure a data analysis is “reproducible”?

Short-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

- Scriptability \(\rightarrow\) R

- Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto

- Version control \(\rightarrow\) Git / GitHub

An Analogy to English

R

An Analogy to English

Rstudio

An Analogy to English

Quarto

An Analogy to English

git

An Analogy to English

github

R and RStudio

What are R and RStudio?

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • Packages make R easily extensible

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R vs. RStudio: Another Analogy

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

Tour: R + RStudio

Option 1:

Sit back and enjoy the show!

Option 2:

Go to your container and launch RStudio.

Tour recap: R + RStudio

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
do_that(to_this, to_that, with_those)


  • Packages are installed with the install.packages() function and loaded with the library function, once per session:
install.packages("package_name")
library(package_name)

R essentials (continued)

Data frames: like the spreadsheets of R

  • Each row of a data frame is an observation
  • Each column of a data frame is a variable

R essentials (continued)

  • Use the question mark ? to get help with objects (like data frames and functions):
?function_name
  • Use the dollar sign $ to access columns
dataframe$column

Note

Generally, you need to use the $ to tell R where to find that column.

R essentials (continued)

  • Use the arrow <- or equals sign = to save objects
x = some_thing
y <- some_other_thing

Note

Check your environment pane for the saved object!

R essentials (continued)

  • Look out for warning and error messages!!!


  • These are essential for figuring out where your code is going wrong.



Note

If you have trouble understanding what a message is saying, there is a high chance someone has explained the message online.

R packages

  • Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1

  • As of 27 August 2024, there are 21,168 R packages available on CRAN (the Comprehensive R Archive Network)2

  • We’re going to work with a small (but important) subset of these!

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The tidyverse is a collection of R packages designed for data science


  • All packages share an underlying philosophy and a common grammar

Quarto

What is Quarto?

  • Fully reproducible reports – each time you render the analysis is ran from the beginning
  • Code goes in chunks; narrative (normal text) goes outside of chunks
  • A visual editor for a familiar / Google docs-like editing experience

Tour: Quarto

Option 1:

Sit back and enjoy the show!

Option 2:

Go to RStudio and open the document ae-01-meet-the-penguins.qmd.

Tour recap: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

How will we use Quarto?

  • Every application exercise, lab, project, etc. is an Quarto document
  • You’ll always have a template Quarto document to start with
  • The amount of scaffolding in the template will decrease over the semester

Git and GitHub

What are Git and GitHub?

Git logo

  • Git is a version control system – like “Track Changes” features from Microsoft Word, on steroids
  • It’s not the only version control system, but it’s a very popular one

GitHub logo

  • GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better

  • We will use GitHub as a platform for web hosting and collaboration (and as our course management system!)

Versioning - done badly

Versioning - done better

Versioning - done even better

with human readable messages

How will we use Git and GitHub?

How will we use Git and GitHub?

How will we use Git and GitHub?

How will we use Git and GitHub?

Git and GitHub tips

  • There are so many and very few people know them all. 99% of the time you will use git to commit, push, and pull:
    • commit: tell git to keep track of what changes you’ve made - use a message!!
    • push: add the changes to the repository (folder)
    • pull: get changes from the repository (folder)
  • There is a great resource for working with git and R: happygitwithr.com. Some of the content in there is beyond the scope of this course, but it’s a good place to look for help.

Tour: Git + GitHub

Option 1:

Sit back and enjoy the show!

Option 2:

Go to the course GitHub organization and clone ae-your_github_name repo to your container.

Tour recap: Git + GitHub

  • Find your application repo, that will always be named using the naming convention assignment_title-your_github_name

  • Click on the green “Code” button, make sure SSH is selected, copy the repo URL

Tour recap: Git + GitHub

Once we made changes to our Quarto document, we

  • went to the Git pane in RStudio

  • staged our changes by clicking the checkboxes next to the relevant files

  • committed our changes with an informative commit message

  • pushed our changes to our application exercise repos

  • confirmed on GitHub that we could see our changes pushed from RStudio

Let’s Practice: AE01