Overview of the wcde package

The wcde package allows for R users to easily download data from the Wittgenstein Centre for Demography and Human Capital Data Explorer as well as containing a number of helpful functions for working with education specific demographic data.

Installation

You can install the released version of wcde from CRAN with:

install.packages("wcde")

Install the developmental version with:

library(devtools)
install_github("guyabel/wcde", ref = "main")

Getting data into R

The get_wcde() function can be used to download data from the Wittgenstein Centre Human Capital Data Explorer. It requires three user inputs

indicator: a short code for the indicator of interest
scenario: a number referring to a SSP narrative, by default 2 is used (for SSP2)
country_code (or country_name): corresponding to the country of interest

library(wcde)
# download education specific tfr data
get_wcde(indicator = "etfr",
         country_name = c("Brazil", "Albania"))
#> # A tibble: 192 × 6
#>    scenario name    country_code education          period     etfr
#>       <dbl> <chr>          <dbl> <chr>              <chr>     <dbl>
#>  1        2 Brazil            76 No Education       2020-2025  2.16
#>  2        2 Albania            8 No Education       2020-2025  2.31
#>  3        2 Brazil            76 Incomplete Primary 2020-2025  2.16
#>  4        2 Albania            8 Incomplete Primary 2020-2025  2.51
#>  5        2 Brazil            76 Primary            2020-2025  2.16
#>  6        2 Albania            8 Primary            2020-2025  2.17
#>  7        2 Brazil            76 Lower Secondary    2020-2025  1.71
#>  8        2 Albania            8 Lower Secondary    2020-2025  1.88
#>  9        2 Brazil            76 Upper Secondary    2020-2025  1.30
#> 10        2 Albania            8 Upper Secondary    2020-2025  1.61
#> # … with 182 more rows

# download education specific survivorship rates
get_wcde(indicator = "eassr",
         country_name = c("Niger", "Korea"))
#> # A tibble: 6,912 × 8
#>    scenario name              country_code age    sex   education   period eassr
#>       <dbl> <chr>                    <dbl> <chr>  <chr> <chr>       <chr>  <dbl>
#>  1        2 Niger                      562 15--19 Male  No Educati… 2020-… 0.987
#>  2        2 Republic of Korea          410 15--19 Male  No Educati… 2020-… 0.999
#>  3        2 Niger                      562 15--19 Male  Incomplete… 2020-… 0.987
#>  4        2 Republic of Korea          410 15--19 Male  Incomplete… 2020-… 0.999
#>  5        2 Niger                      562 15--19 Male  Primary     2020-… 0.989
#>  6        2 Republic of Korea          410 15--19 Male  Primary     2020-… 0.999
#>  7        2 Niger                      562 15--19 Male  Lower Seco… 2020-… 0.990
#>  8        2 Republic of Korea          410 15--19 Male  Lower Seco… 2020-… 0.999
#>  9        2 Niger                      562 15--19 Male  Upper Seco… 2020-… 0.992
#> 10        2 Republic of Korea          410 15--19 Male  Upper Seco… 2020-… 0.999
#> # … with 6,902 more rows

Indicator codes

The indicator input must match the short code from the indicator table. The find_indicator() function can be used to look up short codes (given in the first column) from the wic_indicators data frame:

find_indicator(x = "tfr")
#> # A tibble: 2 × 6
#>   indicator description                       `wcde-v3`  wcde-…¹ wcde-…² defin…³
#>   <chr>     <chr>                             <chr>      <chr>   <chr>   <chr>  
#> 1 etfr      Total Fertility Rate by Education projectio… projec… projec… The av…
#> 2 tfr       Total Fertility Rate              projectio… past-a… past-a… The av…
#> # … with abbreviated variable names ¹`wcde-v2`, ²`wcde-v1`, ³definition_latest

Temporal coverage

By default, get_wdce() returns data for all years or available periods or years. The filter() function in dplyr can be used to filter data for specific years or periods, for example:

library(tidyverse)
get_wcde(indicator = "e0",
         country_name = c("Japan", "Australia")) %>%
  filter(period == "2015-2020")
#> # A tibble: 0 × 6
#> # … with 6 variables: scenario <dbl>, name <chr>, country_code <dbl>,
#> #   sex <chr>, period <chr>, e0 <dbl>

get_wcde(indicator = "sexratio",
         country_name = c("China", "South Korea")) %>%
  filter(year == 2020)
#> # A tibble: 44 × 6
#>    scenario name              country_code age     year sexratio
#>       <dbl> <chr>                    <dbl> <chr>  <dbl>    <dbl>
#>  1        2 China                      156 All     2020    1.05 
#>  2        2 Republic of Korea          410 All     2020    0.999
#>  3        2 China                      156 0--4    2020    1.14 
#>  4        2 Republic of Korea          410 0--4    2020    1.05 
#>  5        2 China                      156 5--9    2020    1.16 
#>  6        2 Republic of Korea          410 5--9    2020    1.05 
#>  7        2 China                      156 10--14  2020    1.17 
#>  8        2 Republic of Korea          410 10--14  2020    1.07 
#>  9        2 China                      156 15--19  2020    1.17 
#> 10        2 Republic of Korea          410 15--19  2020    1.08 
#> # … with 34 more rows

Past data is only available for selected indicators. These can be viewed using the version column:

wic_indicators %>%
  filter(`wcde-v2` == "past-available") %>%
  select(1:2)
#> # A tibble: 28 × 2
#>    indicator description                                     
#>    <chr>     <chr>                                           
#>  1 asfr      Age-Specific Fertility Rate                     
#>  2 assr      Age-Specific Survival Ratio                     
#>  3 bmys      Mean Years of Schooling by Broad Age            
#>  4 bpop      Population Size by Broad Age (000's)            
#>  5 bprop     Educational Attainment Distribution by Broad Age
#>  6 cbr       Crude Birth Rate                                
#>  7 cdr       Crude Death Rate                                
#>  8 e0        Life Expectancy at Birth                        
#>  9 epop      Population Size by Education (000's)            
#> 10 ggapedu15 Gender Gap in Educational Attainment (15+)      
#> # … with 18 more rows

The filter() function can also be used to filter specific indicators to specific age, sex or education groups

get_wcde(indicator = "sexratio",
         country_name = c("China", "South Korea")) %>%
  filter(year == 2020,
         age == "All")
#> # A tibble: 2 × 6
#>   scenario name              country_code age    year sexratio
#>      <dbl> <chr>                    <dbl> <chr> <dbl>    <dbl>
#> 1        2 China                      156 All    2020    1.05 
#> 2        2 Republic of Korea          410 All    2020    0.999

Country names and codes

Country names are guessed using the countrycode package.

get_wcde(indicator = "tfr",
         country_name = c("U.A.E", "Espania", "Österreich"))
#> # A tibble: 48 × 5
#>    scenario name                 country_code period      tfr
#>       <dbl> <chr>                       <dbl> <chr>     <dbl>
#>  1        2 United Arab Emirates          784 2020-2025  1.35
#>  2        2 Spain                         724 2020-2025  1.19
#>  3        2 Austria                        40 2020-2025  1.45
#>  4        2 United Arab Emirates          784 2025-2030  1.39
#>  5        2 Spain                         724 2025-2030  1.25
#>  6        2 Austria                        40 2025-2030  1.48
#>  7        2 United Arab Emirates          784 2030-2035  1.41
#>  8        2 Spain                         724 2030-2035  1.32
#>  9        2 Austria                        40 2030-2035  1.51
#> 10        2 United Arab Emirates          784 2035-2040  1.44
#> # … with 38 more rows

The get_wcde() functions accepts ISO alpha numeric codes for countries via the country_code argument:

get_wcde(indicator = "etfr", country_code = c(44, 100))
#> # A tibble: 192 × 6
#>    scenario name     country_code education          period     etfr
#>       <dbl> <chr>           <dbl> <chr>              <chr>     <dbl>
#>  1        2 Bahamas            44 No Education       2020-2025  2.16
#>  2        2 Bulgaria          100 No Education       2020-2025  1.86
#>  3        2 Bahamas            44 Incomplete Primary 2020-2025  2.16
#>  4        2 Bulgaria          100 Incomplete Primary 2020-2025  1.86
#>  5        2 Bahamas            44 Primary            2020-2025  2.16
#>  6        2 Bulgaria          100 Primary            2020-2025  1.86
#>  7        2 Bahamas            44 Lower Secondary    2020-2025  1.71
#>  8        2 Bulgaria          100 Lower Secondary    2020-2025  1.86
#>  9        2 Bahamas            44 Upper Secondary    2020-2025  1.43
#> 10        2 Bulgaria          100 Upper Secondary    2020-2025  1.51
#> # … with 182 more rows

A full list of available countries and region aggregates, and their codes, can be found in the wic_locations data frame.

wic_locations
#> # A tibble: 232 × 8
#>    name                       isono conti…¹ region dim   wcde-…² wcde-…³ wcde-…⁴
#>    <chr>                      <dbl> <chr>   <chr>  <chr> <lgl>   <lgl>   <lgl>  
#>  1 World                        900 NA      NA     area  TRUE    TRUE    TRUE   
#>  2 Africa                       903 NA      NA     area  TRUE    TRUE    TRUE   
#>  3 Asia                         935 NA      NA     area  TRUE    TRUE    TRUE   
#>  4 Europe                       908 NA      NA     area  TRUE    TRUE    TRUE   
#>  5 Latin America and the Car…   904 NA      NA     area  TRUE    TRUE    TRUE   
#>  6 Northern America             905 NA      NA     area  TRUE    TRUE    TRUE   
#>  7 Oceania                      909 NA      NA     area  TRUE    TRUE    TRUE   
#>  8 Afghanistan                    4 Asia    South… coun… TRUE    TRUE    TRUE   
#>  9 Albania                        8 Europe  South… coun… TRUE    TRUE    TRUE   
#> 10 Algeria                       12 Africa  North… coun… TRUE    TRUE    TRUE   
#> # … with 222 more rows, and abbreviated variable names ¹continent, ²`wcde-v3`,
#> #   ³`wcde-v2`, ⁴`wcde-v1`

Scenarios

By default get_wcde() returns data for Medium (SSP2) scenario. Results for different SSP scenarios can be returned by passing a different (or multiple) scenario values to the scenario argument in get_data().

get_wcde(indicator = "growth",
         country_name = c("India", "China"),
         scenario = c(1:3, 22, 23)) %>%
  filter(period == "2095-2100")
#> # A tibble: 10 × 5
#>    scenario name  country_code period    growth
#>       <dbl> <chr>        <dbl> <chr>      <dbl>
#>  1        1 India          356 2095-2100 -1.05 
#>  2        1 China          156 2095-2100 -1.11 
#>  3        2 India          356 2095-2100 -0.545
#>  4        2 China          156 2095-2100 -1.03 
#>  5        3 India          356 2095-2100  0.170
#>  6        3 China          156 2095-2100 -0.428
#>  7       22 India          356 2095-2100 -0.545
#>  8       22 China          156 2095-2100 -1.03 
#>  9       23 India          356 2095-2100 -0.545
#> 10       23 China          156 2095-2100 -1.03

Set include_scenario_names = TRUE to include a columns with the full names of the scenarios

get_wcde(indicator = "tfr",
         country_name = c("Kenya", "Nigeria", "Algeria"),
         scenario = 1:3,
         include_scenario_names = TRUE) %>%
  filter(period == "2045-2050")
#> # A tibble: 9 × 7
#>   scenario scenario_name              scenario_abb name    countr…¹ period   tfr
#>      <dbl> <chr>                      <chr>        <chr>      <dbl> <chr>  <dbl>
#> 1        1 Rapid Development (SSP1)   SSP1         Kenya        404 2045-…  1.62
#> 2        1 Rapid Development (SSP1)   SSP1         Nigeria      566 2045-…  2.62
#> 3        1 Rapid Development (SSP1)   SSP1         Algeria       12 2045-…  1.52
#> 4        2 Medium (SSP2)              SSP2         Kenya        404 2045-…  2.32
#> 5        2 Medium (SSP2)              SSP2         Nigeria      566 2045-…  3.75
#> 6        2 Medium (SSP2)              SSP2         Algeria       12 2045-…  2.04
#> 7        3 Stalled Development (SSP3) SSP3         Kenya        404 2045-…  3.02
#> 8        3 Stalled Development (SSP3) SSP3         Nigeria      566 2045-…  4.83
#> 9        3 Stalled Development (SSP3) SSP3         Algeria       12 2045-…  2.66
#> # … with abbreviated variable name ¹country_code

Additional details of the pathways for each scenario numeric code can be found in the wic_scenarios object. Further background and links to the corresponding literature are provided in the Data Explorer

wic_scenarios
#> # A tibble: 9 × 6
#>   scenario_name                          scena…¹ scena…² wcde-…³ wcde-…⁴ wcde-…⁵
#>   <chr>                                    <dbl> <chr>   <lgl>   <lgl>   <lgl>  
#> 1 Rapid Development (SSP1)                     1 SSP1    TRUE    TRUE    TRUE   
#> 2 Medium (SSP2)                                2 SSP2    TRUE    TRUE    TRUE   
#> 3 Stalled Development (SSP3)                   3 SSP3    TRUE    TRUE    TRUE   
#> 4 Inequality (SSP4)                            4 SSP4    TRUE    FALSE   TRUE   
#> 5 Conventional Development (SSP5)              5 SSP5    TRUE    FALSE   TRUE   
#> 6 Medium - Zero Migration (SSP2-ZM)           22 SSP2-ZM TRUE    TRUE    FALSE  
#> 7 Medium - Double Migration (SSP2-DM)         23 SSP2-DM TRUE    TRUE    FALSE  
#> 8 Medium - Constant Enrolment Rate (SSP…      20 SSP2-C… FALSE   FALSE   TRUE   
#> 9 Medium - Fast Track Education (SSP2-F…      21 SSP2-FT FALSE   FALSE   TRUE   
#> # … with abbreviated variable names ¹scenario, ²scenario_abb, ³`wcde-v3`,
#> #   ⁴`wcde-v2`, ⁵`wcde-v1`

All countries data

Data for all countries can be obtained by not setting country_name or country_code

get_wcde(indicator = "mage")
#> # A tibble: 3,876 × 5
#>    scenario name                     country_code  year  mage
#>       <dbl> <chr>                           <dbl> <dbl> <dbl>
#>  1        2 Bulgaria                          100  2020  40.1
#>  2        2 Myanmar                           104  2020  24.6
#>  3        2 Burundi                           108  2020  11.5
#>  4        2 Belarus                           112  2020  35.9
#>  5        2 Cambodia                          116  2020  22.0
#>  6        2 Algeria                            12  2020  23.5
#>  7        2 Cameroon                          120  2020  13.5
#>  8        2 Canada                            124  2020  35.9
#>  9        2 Cape Verde                        132  2020  21.8
#> 10        2 Central African Republic          140  2020  10.7
#> # … with 3,866 more rows

Multiple indicators

The get_wdce() function needs to be called multiple times to download multiple indicators. This can be done using the map() function in purrr

mi <- tibble(ind = c("odr", "nirate", "ggapedu25")) %>%
  mutate(d = map(.x = ind, .f = ~get_wcde(indicator = .x)))
mi
#> # A tibble: 3 × 2
#>   ind       d                    
#>   <chr>     <list>               
#> 1 odr       <tibble [3,876 × 5]> 
#> 2 nirate    <tibble [3,648 × 5]> 
#> 3 ggapedu25 <tibble [23,256 × 6]>

mi %>%
  filter(ind == "odr") %>%
  select(-ind) %>%
  unnest(cols = d)
#> # A tibble: 3,876 × 5
#>    scenario name                     country_code  year    odr
#>       <dbl> <chr>                           <dbl> <dbl>  <dbl>
#>  1        2 Bulgaria                          100  2020 0.347 
#>  2        2 Myanmar                           104  2020 0.0930
#>  3        2 Burundi                           108  2020 0.0486
#>  4        2 Belarus                           112  2020 0.246 
#>  5        2 Cambodia                          116  2020 0.0790
#>  6        2 Algeria                            12  2020 0.0937
#>  7        2 Cameroon                          120  2020 0.0505
#>  8        2 Canada                            124  2020 0.268 
#>  9        2 Cape Verde                        132  2020 0.0792
#> 10        2 Central African Republic          140  2020 0.0501
#> # … with 3,866 more rows

mi %>%
  filter(ind == "nirate") %>%
  select(-ind) %>%
  unnest(cols = d)
#> # A tibble: 3,648 × 5
#>    scenario name                     country_code period    nirate
#>       <dbl> <chr>                           <dbl> <chr>      <dbl>
#>  1        2 Bulgaria                          100 2020-2025 -10.7 
#>  2        2 Myanmar                           104 2020-2025   7.46
#>  3        2 Burundi                           108 2020-2025  28.0 
#>  4        2 Belarus                           112 2020-2025  -5.95
#>  5        2 Cambodia                          116 2020-2025  12.8 
#>  6        2 Algeria                            12 2020-2025  17.3 
#>  7        2 Cameroon                          120 2020-2025  27.0 
#>  8        2 Canada                            124 2020-2025   1.58
#>  9        2 Cape Verde                        132 2020-2025  11.8 
#> 10        2 Central African Republic          140 2020-2025  33.4 
#> # … with 3,638 more rows

mi %>%
  filter(ind == "ggapedu25") %>%
  select(-ind) %>%
  unnest(cols = d)
#> # A tibble: 23,256 × 6
#>    scenario name                     country_code  year education    ggapedu25
#>       <dbl> <chr>                           <dbl> <dbl> <chr>            <dbl>
#>  1        2 Bulgaria                          100  2020 No Education -4.63e- 3
#>  2        2 Myanmar                           104  2020 No Education -4.30e- 2
#>  3        2 Burundi                           108  2020 No Education  1.47e- 1
#>  4        2 Belarus                           112  2020 No Education -5.76e- 4
#>  5        2 Cambodia                          116  2020 No Education -1.19e- 1
#>  6        2 Algeria                            12  2020 No Education -1.63e- 1
#>  7        2 Cameroon                          120  2020 No Education -1.02e- 1
#>  8        2 Canada                            124  2020 No Education  1.36e-20
#>  9        2 Cape Verde                        132  2020 No Education  2.61e- 2
#> 10        2 Central African Republic          140  2020 No Education -3.13e- 1
#> # … with 23,246 more rows

Previous versions

Previous versions of projections from the Wittgenstein Centre for Demography are available using the version argument in get_wdce(). Set version to "wcde-v1" or "wcde-v2" or "wcde-v3" (the default since 2024).

get_wcde(indicator = "etfr",
         country_name = c("Brazil", "Albania"),
         version = "wcde-v2")
#> # A tibble: 204 × 6
#>    scenario name    country_code education          period     etfr
#>       <dbl> <chr>          <dbl> <chr>              <chr>     <dbl>
#>  1        2 Brazil            76 No Education       2015-2020  2.47
#>  2        2 Albania            8 No Education       2015-2020  1.88
#>  3        2 Brazil            76 Incomplete Primary 2015-2020  2.47
#>  4        2 Albania            8 Incomplete Primary 2015-2020  1.88
#>  5        2 Brazil            76 Primary            2015-2020  2.47
#>  6        2 Albania            8 Primary            2015-2020  1.88
#>  7        2 Brazil            76 Lower Secondary    2015-2020  1.89
#>  8        2 Albania            8 Lower Secondary    2015-2020  1.9 
#>  9        2 Brazil            76 Upper Secondary    2015-2020  1.37
#> 10        2 Albania            8 Upper Secondary    2015-2020  1.57
#> # … with 194 more rows

Note, not all indicators and scenarios are available in all versions - see the the wic_indicators and wic_scenarios objects for further details or see above.

Server

If you have trouble with connecting to the IIASA server you can try alternative hosts using the server option in get_wcde(), which can be set to "iiasa" (default) "github" and "1&1".

get_wcde(indicator = "etfr",
         country_name = c("Brazil", "Albania"), 
         version = "wcde-v2", server = "github")
#> # A tibble: 204 × 6
#>    scenario name    country_code education          period     etfr
#>       <dbl> <chr>          <dbl> <chr>              <chr>     <dbl>
#>  1        2 Brazil            76 No Education       2015-2020  2.47
#>  2        2 Albania            8 No Education       2015-2020  1.88
#>  3        2 Brazil            76 Incomplete Primary 2015-2020  2.47
#>  4        2 Albania            8 Incomplete Primary 2015-2020  1.88
#>  5        2 Brazil            76 Primary            2015-2020  2.47
#>  6        2 Albania            8 Primary            2015-2020  1.88
#>  7        2 Brazil            76 Lower Secondary    2015-2020  1.89
#>  8        2 Albania            8 Lower Secondary    2015-2020  1.9 
#>  9        2 Brazil            76 Upper Secondary    2015-2020  1.37
#> 10        2 Albania            8 Upper Secondary    2015-2020  1.57
#> # … with 194 more rows

You may also set server = "search-available" to search through the three possible data location to download the data wherever it is available.

Working with population data

Population data for a range of age-sex-educational attainment combinations can be obtained by setting indicator = "pop" in get_wcde() and specifying a pop_age, pop_sex and pop_edu arguments. By default each of the three population breakdown arguments are set to “total”

get_wcde(indicator = "pop", country_name = "India")
#> # A tibble: 17 × 5
#>    scenario name  country_code  year      pop
#>       <dbl> <chr>        <dbl> <dbl>    <dbl>
#>  1        2 India          356  2020 1389966.
#>  2        2 India          356  2025 1445480.
#>  3        2 India          356  2030 1501725.
#>  4        2 India          356  2035 1548067.
#>  5        2 India          356  2040 1583687.
#>  6        2 India          356  2045 1607695.
#>  7        2 India          356  2050 1620358.
#>  8        2 India          356  2055 1625062.
#>  9        2 India          356  2060 1622572.
#> 10        2 India          356  2065 1612143.
#> 11        2 India          356  2070 1594676.
#> 12        2 India          356  2075 1570024.
#> 13        2 India          356  2080 1539493.
#> 14        2 India          356  2085 1504981.
#> 15        2 India          356  2090 1468261.
#> 16        2 India          356  2095 1430167.
#> 17        2 India          356  2100 1391608.

The pop_age argument can be set to all to get population data broken down in five-year age groups. The pop_sex argument can be set to both to get population data broken down into female and male groups. The pop_edu argument can be set to four, six or eight to get population data broken down into education categorizations with different levels of detail.

get_wcde(indicator = "pop", country_code = 900, pop_edu = "four")
#> # A tibble: 85 × 6
#>    scenario name  country_code  year education           pop
#>       <dbl> <fct>        <dbl> <dbl> <fct>             <dbl>
#>  1        2 World          900  2020 Under 15       2012336.
#>  2        2 World          900  2020 No Education    756762.
#>  3        2 World          900  2020 Primary        1208824.
#>  4        2 World          900  2020 Secondary      2883491.
#>  5        2 World          900  2020 Post Secondary  943560.
#>  6        2 World          900  2025 Under 15       2002922.
#>  7        2 World          900  2025 No Education    724867.
#>  8        2 World          900  2025 Primary        1212577.
#>  9        2 World          900  2025 Secondary      3114657.
#> 10        2 World          900  2025 Post Secondary 1096623.
#> # … with 75 more rows

The population breakdown arguments can be used in combination to provide further breakdowns, for example sex and education specific population totals

get_wcde(indicator = "pop", country_code = 900, pop_edu = "six", pop_sex = "both")
#> # A tibble: 238 × 7
#>    scenario name  country_code  year sex    education               pop
#>       <dbl> <fct>        <dbl> <dbl> <fct>  <fct>                 <dbl>
#>  1        2 World          900  2020 Male   Under 15           1037900.
#>  2        2 World          900  2020 Male   No Education        308168.
#>  3        2 World          900  2020 Male   Incomplete Primary  197055.
#>  4        2 World          900  2020 Male   Primary             426676.
#>  5        2 World          900  2020 Male   Lower Secondary     623289.
#>  6        2 World          900  2020 Male   Upper Secondary     848609.
#>  7        2 World          900  2020 Male   Post Secondary      484476.
#>  8        2 World          900  2020 Female Under 15            974436.
#>  9        2 World          900  2020 Female No Education        448594.
#> 10        2 World          900  2020 Female Incomplete Primary  186376.
#> # … with 228 more rows

The full age-sex-education specific data can also be obtained by setting indicator = "epop" in get_wcde().

Population pyramids

Create population pyramids by setting male population values to negative equivalent to allow for divergent columns from the y axis.

w <- get_wcde(indicator = "pop", country_code = 900,
              pop_age = "all", pop_sex = "both", pop_edu = "four",
              version = "wcde-v2")
w
#> # A tibble: 6,510 × 8
#>    scenario name  country_code  year age   sex    education          pop
#>       <dbl> <fct>        <dbl> <int> <fct> <fct>  <fct>            <dbl>
#>  1        2 World          900  1950 0--4  Male   Under 15       172362.
#>  2        2 World          900  1950 0--4  Male   No Education        0 
#>  3        2 World          900  1950 0--4  Male   Primary             0 
#>  4        2 World          900  1950 0--4  Male   Secondary           0 
#>  5        2 World          900  1950 0--4  Male   Post Secondary      0 
#>  6        2 World          900  1950 0--4  Female Under 15       166026.
#>  7        2 World          900  1950 0--4  Female No Education        0 
#>  8        2 World          900  1950 0--4  Female Primary             0 
#>  9        2 World          900  1950 0--4  Female Secondary           0 
#> 10        2 World          900  1950 0--4  Female Post Secondary      0 
#> # … with 6,500 more rows

w <- w %>%
  mutate(pop_pm = ifelse(test = sex == "Male", yes = -pop, no = pop),
         pop_pm = pop_pm/1e3)
w
#> # A tibble: 6,510 × 9
#>    scenario name  country_code  year age   sex    education          pop pop_pm
#>       <dbl> <fct>        <dbl> <int> <fct> <fct>  <fct>            <dbl>  <dbl>
#>  1        2 World          900  1950 0--4  Male   Under 15       172362.  -172.
#>  2        2 World          900  1950 0--4  Male   No Education        0      0 
#>  3        2 World          900  1950 0--4  Male   Primary             0      0 
#>  4        2 World          900  1950 0--4  Male   Secondary           0      0 
#>  5        2 World          900  1950 0--4  Male   Post Secondary      0      0 
#>  6        2 World          900  1950 0--4  Female Under 15       166026.   166.
#>  7        2 World          900  1950 0--4  Female No Education        0      0 
#>  8        2 World          900  1950 0--4  Female Primary             0      0 
#>  9        2 World          900  1950 0--4  Female Secondary           0      0 
#> 10        2 World          900  1950 0--4  Female Post Secondary      0      0 
#> # … with 6,500 more rows

Standard plot

Use standard ggplot code to create population pyramid with

scale_x_symmetric() from the lemon package to allow for equal male and female x-axis
fill colours set to the wic_col4 object in the wcde package which contains the names of the colours used in the Wittgenstein Centre Human Capital Data Explorer Data Explorer.

Note wic_col6 and wic_col8 objects also exist for equivalent plots of population data objects with corresponding numbers of categories of education.

library(lemon)

w %>%
  filter(year == 2020) %>%
  ggplot(mapping = aes(x = pop_pm, y = age, fill = fct_rev(education))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black") +
  scale_x_symmetric(labels = abs) +
  scale_fill_manual(values = wic_col4, name = "Education") +
  labs(x = "Population (millions)", y = "Age") +
  theme_bw()

Sex label position

Add male and female labels on the x-axis by

Creating a facet plot with the strips on the bottom with transparent backgrounds and no space between.
Set the x axis to have zero expansion beyond the values in the data allowing the two sides of the pyramids to meet.
Add a geom_blank() to allow for equal x-axis and additional space at the end of largest columns.

w <- w %>%
  mutate(pop_max = ifelse(sex == "Male", -max(pop/1e3), max(pop/1e3)))

w %>%
  filter(year == 2020) %>%
  ggplot(mapping = aes(x = pop_pm, y = age, fill = fct_rev(education))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black") +
  scale_x_continuous(labels = abs, expand = c(0, 0)) +
  scale_fill_manual(values = wic_col4, name = "Education") +
  labs(x = "Population (millions)", y = "Age") +
  facet_wrap(facets = "sex", scales = "free_x", strip.position = "bottom") +
  geom_blank(mapping = aes(x = pop_max * 1.1)) +
  theme(panel.spacing.x = unit(0, "pt"),
        strip.placement = "outside",
        strip.background = element_rect(fill = "transparent"),
        strip.text.x = element_text(margin = margin( b = 0, t = 0)))

Animate

Animate the pyramid through the past data and projection periods using the transition_time() function in the gganimate package

library(gganimate)

ggplot(data = w,
       mapping = aes(x = pop_pm, y = age, fill = fct_rev(education))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black") +
  scale_x_continuous(labels = abs, expand = c(0, 0)) +
  scale_fill_manual(values = wic_col4, name = "Education") +
  facet_wrap(facets = "sex", scales = "free_x", strip.position = "bottom") +
  geom_blank(mapping = aes(x = pop_max * 1.1)) +
  theme(panel.spacing.x = unit(0, "pt"),
        strip.placement = "outside",
        strip.background = element_rect(fill = "transparent"),
        strip.text.x = element_text(margin = margin(b = 0, t = 0))) +
  transition_time(time = year) +
  labs(x = "Population (millions)", y = "Age",
       title = 'SSP2 World Population {round(frame_time)}')

Guy J. Abel, Samir K.C., Michaela Potancokova, Claudia Reiter, Andrea Tamburini and Dilek Yildiz