AE 07: Joining

Goal

Let’s suppose we were interested in examining the total populations of continents. However, these are the two data sets we have available:

world-pop-22.csv: contains data from The World Bank with countries’ populations
continents.csv: contains data mapping countries to continents

To do analysis of population by continent, we need to join!

Load packages and read in data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

population <- read_csv("data/world-pop-2022.csv")

Rows: 217 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (2): year, population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

continents <- read_csv("data/continents.csv")

Rows: 285 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): entity, code, continent
dbl (1): year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a look at the data:

population

# A tibble: 217 × 3
   country              year population
   <chr>               <dbl>      <dbl>
 1 Afghanistan          2022    41129. 
 2 Albania              2022     2778. 
 3 Algeria              2022    44903. 
 4 American Samoa       2022       44.3
 5 Andorra              2022       79.8
 6 Angola               2022    35589. 
 7 Antigua and Barbuda  2022       93.8
 8 Argentina            2022    46235. 
 9 Armenia              2022     2780. 
10 Aruba                2022      106. 
# ℹ 207 more rows

continents

# A tibble: 285 × 4
   entity                code      year continent    
   <chr>                 <chr>    <dbl> <chr>        
 1 Abkhazia              OWID_ABK  2015 Asia         
 2 Afghanistan           AFG       2015 Asia         
 3 Akrotiri and Dhekelia OWID_AKD  2015 Asia         
 4 Aland Islands         ALA       2015 Europe       
 5 Albania               ALB       2015 Europe       
 6 Algeria               DZA       2015 Africa       
 7 American Samoa        ASM       2015 Oceania      
 8 Andorra               AND       2015 Europe       
 9 Angola                AGO       2015 Africa       
10 Anguilla              AIA       2015 North America
# ℹ 275 more rows

Question 1: Join Concept

We want to know what continent all of the variables in the population data frame are in.

What type of join should we use?
Which variable in each data frame should we use?

Question 2: Implement the Join

Join the two data frames and name assign the joined data frame to a new data frame population_continents .

population_continents <- population |>
  left_join(continents, by = join_by(country == entity))

How does that look? Take a look at your new data frame!

Question 3: What went wrong?

It might not be obvious, but something is a little weird about this. Go ahead and filter the resulting data frame to see if any of the continent values are NA.

population_continents |>
  filter(is.na(continent))

# A tibble: 6 × 6
  country                   year.x population code  year.y continent
  <chr>                      <dbl>      <dbl> <chr>  <dbl> <chr>    
1 Congo, Dem. Rep.            2022     99010. <NA>      NA <NA>     
2 Congo, Rep.                 2022      5970. <NA>      NA <NA>     
3 Hong Kong SAR, China        2022      7346. <NA>      NA <NA>     
4 Korea, Dem. People's Rep.   2022     26069. <NA>      NA <NA>     
5 Korea, Rep.                 2022     51628. <NA>      NA <NA>     
6 Kyrgyz Republic             2022      6975. <NA>      NA <NA>

There are! This means that there were no rows in the continents data frame with those countries. That seems a little weird. Take a scroll through the continents data frame.

Do you see what the cause of this is??

Question 4: Let’s fix this!

So, countries have to be spelled the exact same way in each data set. I’m going to show you code that renames the missing countries in the population data set to match the spelling in contintents. Then, we will re-run the join: we are no longer missing these values!

population_new_spellings <- population |>
  mutate(country = case_when(
    country == "Congo, Dem. Rep." ~ "Democratic Republic of Congo",
    country == "Congo, Rep." ~ "Congo",
    country == "Hong Kong SAR, China" ~ "Hong Kong",
    country == "Korea, Dem. People's Rep." ~ "North Korea",
    country == "Korea, Rep." ~ "South Korea",
    country == "Kyrgyz Republic" ~ "Kyrgyzstan",
    .default = country
    )
  )

population_continent_new = population_new_spellings |>
  left_join(continents, by = join_by(country == entity))

Are there any NAs left? Run the code to see!

population_continent_new |>
  filter(is.na(continent))

# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, year.x <dbl>, population <dbl>, code <chr>,
#   year.y <dbl>, continent <chr>

The Bigger Picture

How does what we just did fit into the bigger picture of the class?

You are almost never going to be joining data sets just for fun! Joining data sets is a way to enable investigating more interesting patterns in data.
Now, you could create plots or compute summary statistics to investigate the relationship between continent and total population.

For extra practice, try making a bar plot of total population in each country.

To see something new, try making a ‘lollipop chart’ of population by country.