Rows: 217 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (2): year, population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 285 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): entity, code, continent
dbl (1): year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the data:
population
# A tibble: 217 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 2022 41129.
2 Albania 2022 2778.
3 Algeria 2022 44903.
4 American Samoa 2022 44.3
5 Andorra 2022 79.8
6 Angola 2022 35589.
7 Antigua and Barbuda 2022 93.8
8 Argentina 2022 46235.
9 Armenia 2022 2780.
10 Aruba 2022 106.
# ℹ 207 more rows
continents
# A tibble: 285 × 4
entity code year continent
<chr> <chr> <dbl> <chr>
1 Abkhazia OWID_ABK 2015 Asia
2 Afghanistan AFG 2015 Asia
3 Akrotiri and Dhekelia OWID_AKD 2015 Asia
4 Aland Islands ALA 2015 Europe
5 Albania ALB 2015 Europe
6 Algeria DZA 2015 Africa
7 American Samoa ASM 2015 Oceania
8 Andorra AND 2015 Europe
9 Angola AGO 2015 Africa
10 Anguilla AIA 2015 North America
# ℹ 275 more rows
Question 1: Join Concept
We want to know what continent all of the variables in the population data frame are in.
What type of join should we use?
Which variable in each data frame should we use?
Question 2: Implement the Join
Join the two data frames and name assign the joined data frame to a new data frame population_continents .
population_continents<-population|>left_join(continents, by =join_by(country==entity))
How does that look? Take a look at your new data frame!
Question 3: What went wrong?
It might not be obvious, but something is a little weird about this. Go ahead and filter the resulting data frame to see if any of the continent values are NA.
# A tibble: 6 × 6
country year.x population code year.y continent
<chr> <dbl> <dbl> <chr> <dbl> <chr>
1 Congo, Dem. Rep. 2022 99010. <NA> NA <NA>
2 Congo, Rep. 2022 5970. <NA> NA <NA>
3 Hong Kong SAR, China 2022 7346. <NA> NA <NA>
4 Korea, Dem. People's Rep. 2022 26069. <NA> NA <NA>
5 Korea, Rep. 2022 51628. <NA> NA <NA>
6 Kyrgyz Republic 2022 6975. <NA> NA <NA>
There are! This means that there were no rows in the continents data frame with those countries. That seems a little weird. Take a scroll through the continents data frame.
Do you see what the cause of this is??
Question 4: Let’s fix this!
So, countries have to be spelled the exact same way in each data set. I’m going to show you code that renames the missing countries in the population data set to match the spelling in contintents. Then, we will re-run the join: we are no longer missing these values!