Demographic microsimulations in R using SOCSIM: Modelling population and kinship dynamics

A Member Initiated Meeting at the 2023 Meeting of the Population Association of America (PAA); New Orleans, Apr 12, 2023

1. Setup
2. Running a simulation and retrieving the input demographic rates
3. What can you do with simulation output? Example: estimates of bereavement
4. Learn more
References

Introduction: What is SOCSIM and how does it work?

Socsim was originally developed for Unix at UC Berkeley (E. Hammel et al. 1976), where it has been maintained for decades. The current release of rsocsim aims to be Operating System-agnostic and, for the most part, back-compatible with the original Socsim distribution. For more on the original SOCSIM, see https://lab.demog.berkeley.edu/socsim/ and especially Mason (2016).

The following description of SOCSIM was adapted from the Supplementary Materials of Alburez‐Gutierrez, Mason, and Zagheni (2021). SOCSIM is an open source and extensible demographic microsimulation program. It is written in the C programming language and relies heavily on arrays of linked lists to keep track of kinship relationships and to store information about simulated individuals. The simulator takes as input initial population files and monthly age-specific fertility and mortality demographic rates. The individual is the unit of analysis of the simulator. Each person is subject to a set of rates, expressed as monthly probabilities of events, given certain demographic characteristics, like age and sex. Every month, each individual faces the risk of experiencing a number of events, including childbirth, death, and marriage. The selection of the event and the waiting time until the event occurs are determined stochastically using a competing risk model. Some other constraints are included in the simulation program in order to draw events only for individuals that are eligible for the events (e.g. to allow for a minimum interval of time between births from the same mother, to avoid social taboos such as incest, etc.). Each event for which the individual is at risk is modeled as a piece-wise exponential distribution. The waiting time until each event occurs is randomly generated according to the associated demographic rates. The individual’s next event is the one with the shortest waiting time.

At the end of the simulation, population files that contain a list of everyone who has ever lived in the population are created. In these files, each individual is an observation in a rectangular data file with records of demographic characteristics for the individual, and identification numbers for key kinship relations. SOCSIM models “closed” populations: Individuals may enter and exit the simulation only by (simulated) birth and death.

SOCSIM has been used extensively in social science research to study, among other things, dynamics of kin availability (E. A. Hammel 2005; Kenneth W. Wachter 1997; Verdery and Margolis 2017), generational overlap (Margolis and Verdery 2019; Alburez‐Gutierrez, Mason, and Zagheni 2021), and kin loss (Zagheni 2011; Verdery et al. 2020; Snyder et al. 2022).

1. Setup

We recommend that you go through point 1 of this tutorial before the start of the workshop. If you have any questions/difficulties, please get in touch with the workshop coordinator (Diego Alburez).

1.1. Installation

Install the development version of rsocsim from GitHub (https://github.com/MPIDR/rsocsim). We may have made changes to the rsocsim package ahead of this workshop. To be on the safe side, if you had already installed the package, please uninstall it and and install it again.

# remove.packages("rsocsim")
# install.packages("devtools")
devtools::install_github("MPIDR/rsocsim")

Let’s see if everything is working fine. rsocsim has a simple built-in simulation that you can run to see if the package was installed correctly. For now, let’s run the code without focusing on the technical details:

library(rsocsim)

# create a new folder for all the files related to a simulation.
# this will be in your home- or user-directory:
folder = rsocsim::create_simulation_folder()

# create a new supervisory-file. supervisory-files tell socsim what
# to simulate. create_sup_file will create a very basic supervisory file
# and it copies some rate-files that will also be needed into the 
# simulation folder:
supfile = rsocsim::create_sup_file(folder)

# Choose a random-number seed:
seed = 300

# Start the simulation:
rsocsim::socsim(folder,supfile,seed)

## [1] "Start run1simulationwithfile"
## [1] "C:/Users/calderonbernal/socsim/socsim_sim_8792"
## [1] "300"
## Start socsim
## start socsim main. MAXUYEARS: 200; MAXUMONTHS: 2400
## ratefile: socsim.sup
## 
## v18a!-command-line argv[0]: zerothArgument| argv[1]: socsim.sup| argv[2]: 300| argv[3]: 1
## random_number seed: 300| command-line argv[1]: socsim.sup| argv[2]: 300
## compatibility_mode: 1| command-line argv[3]: 1
## Socsim Version: STANDARD-UNENHANCED-VERSION
## initialize_segment_vars
## initialize_segment_vars done
## 18b - loading -.sup-file: socsim.sup
## 18b-load.cpp->load . socsim.sup
## #load.cpp->load 4. socsim.sup
## 18b-load.cpp->load . SWEfert2022
## #load.cpp->load 4. SWEfert2022
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## 18b-load.cpp->load . SWEmort2022
## #load.cpp->load 4. SWEmort2022
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## Incomplete rate set, will add rate till MAXUYEARS (death-->1.0; others:0.0) 12
## ------------4marriage_eval == DISTRIBUTION . socsim.sup
## ------------6------------7
##  output file names:
##  init_new.opop|init_new.omar|init_new.opox|sim_results_socsim.sup_300_/result.pyr|sim_results_socsim.sup_300_/result.stat|init_new.otx|
## fix pop pointers..
## Starting month is 601
## Initial size of pop 8000  (living: 8000)
## ------------aa3s------------aa32New events generated for all living persons
## ------------b1month:  700 PopLive:  9414 Brths:  16 Dths:   0 Mrgs: 11 Dvs:  0 Mq: 3728 Fq:0 ti1: 0.3 ti2: 0.007000 0.5037
## month:  800 PopLive: 10926 Brths:  12 Dths:   1 Mrgs:  6 Dvs:  0 Mq: 3890 Fq:0 ti1: 0.4 ti2: 0.008000 0.5287
## month:  900 PopLive: 12260 Brths:  14 Dths:   0 Mrgs:  4 Dvs:  0 Mq: 4031 Fq:0 ti1: 0.6 ti2: 0.006000 0.3693
## month: 1000 PopLive: 13397 Brths:   9 Dths:   2 Mrgs:  4 Dvs:  0 Mq: 4134 Fq:0 ti1: 0.7 ti2: 0.006000 0.3511
## month: 1100 PopLive: 14172 Brths:  16 Dths:   6 Mrgs:  6 Dvs:  0 Mq: 4135 Fq:0 ti1: 0.9 ti2: 0.008000 0.4679
## month: 1200 PopLive: 14518 Brths:  13 Dths:  11 Mrgs:  6 Dvs:  0 Mq: 4000 Fq:0 ti1: 1.0 ti2: 0.006000 0.3750
## month: 1300 PopLive: 14323 Brths:  14 Dths:  20 Mrgs:  4 Dvs:  0 Mq: 3891 Fq:0 ti1: 1.2 ti2: 0.008000 0.5284
## month: 1400 PopLive: 13816 Brths:  13 Dths:  15 Mrgs:  4 Dvs:  0 Mq: 3746 Fq:0 ti1: 1.4 ti2: 0.008000 0.5701
## month: 1500 PopLive: 13330 Brths:  11 Dths:  11 Mrgs:  5 Dvs:  0 Mq: 3679 Fq:0 ti1: 1.6 ti2: 0.009000 0.6649
## month: 1600 PopLive: 12944 Brths:  10 Dths:  15 Mrgs:  4 Dvs:  0 Mq: 3593 Fq:0 ti1: 1.7 ti2: 0.008000 0.6197
## month: 1700 PopLive: 12525 Brths:  10 Dths:  20 Mrgs:  5 Dvs:  0 Mq: 3436 Fq:0 ti1: 1.9 ti2: 0.007000 0.5929
## month: 1800 PopLive: 12009 Brths:  10 Dths:  16 Mrgs:  7 Dvs:  0 Mq: 3275 Fq:0 ti1: 2.0 ti2: 0.008000 0.7459
## 
## 
## Socsim Main Done
## Socsim Done.
## [1] "restore previous working dir: U:/SOCSIM/rsocsim_workshop_paa"

## [1] 1

You should see a long output in the console, at the end of which is the message “Socsim done”. Ignore the two warning messages (‘can’t open transition history file. Hope that’s OK’). They are expected.

For more details on the package, the SOCSIM simulator, its history and applications, see rsocsim’s website: https://mpidr.github.io/rsocsim/.

1.2. Install other packages

We will need these other for this workshop (if you don’t have them already, please install them):

# devtools::install_github("MPIDR/rsocsim")
library(rsocsim)
# install.packages("tidyverse")
library(tidyverse) #For data wrangling
# install.packages("ggh4x")
library(ggh4x) #for extended facet plotting
# install.packages("data.table") #for large datasets
library(data.table)

2. Running a simulation and retrieving the input demographic rates

Instructor: Liliana P. Calderón-Bernal, https://twitter.com/lp_calderonb

Objectives

Use data from the Human Fertility Database (HFD) and Human Mortality Database (HMD) to create a simulation from scratch
Run a simple SOCSIM simulation for the USA
Estimate age-specific fertility and mortality rates using the SOCSIM-generated microdata
Check that the ‘output’ SOCSIM rates match the ‘input’ HFD/HMD rates

2.1. The simulation: supervisory and rate files

To provide an example of rsocsim usage, we will run a SOCSIM microsimulation for the United States of America (USA) over the period 1933-2020. This microsimulation uses as input: age-specific fertility rates by calendar year and year of age downloaded from the Human Fertility Database (HFD), and age-specific probabilities of death by sex downloaded from the Human Mortality Database (HMD). To run the simulation, the original HFD and HMD data must be first converted to monthly rates/probabilities and SOCSIM format as described in Socsim Oversimplified (c.f. Mason 2016 for details). The rates corresponding to each year are specified in the supervisory file socsim_USA.sup, provided in this repository, which will be used to run the simulation. The first segment of the simulation runs for 100 years (1200 months) to produce a stable age structure, based on HFD rates and HMD probabilities for 1933. Each of the following segments runs for one year (12 months) with the corresponding fertility and mortality files.

2.2. Write the input rate files for a SOCSIM simulation for the USA, 1933-2020

The structure of the rate files is essential to run SOCSIM. According to Socsim Oversimplified (c.f. Mason 2016, 26–28), a rate block is a complete set of age speciﬁc rates governing a demographic event for people of a particular sex, group and marital status. In the rate files, the order is always event (birth, death, marriage), group (1, ..), sex (F or M), and marital status (married, single, divorced, widowed). For the birth rates, this can be followed by number indicating parity. Each subsequent line contains a one month rate or probability and the age interval over which it holds (years and months of the upper age bound). The interval includes the upper age bound in previous line and ends just before the upper age bound in current line. The first two numbers are years and months of the upper age bound, which are added together, e.g. upper age bound of “1 0” or “0 12” refers to the first year of life [0,1).

Let’s load some functions we prepared to write SOCSIM rate files from HFD and HMD.

# Load functions to write SOCSIM rate files
source("functions.R")

2.2.1. Fertility files for SOCSIM

For the input Age-Specific Female Fertility Rates, we keep the whole age range included in HFD [12-55], but limit the open-ended age intervals 12- and 55+ to one year, i.e. [12-13) and [55-56). Additionally, SOCSIM requires a final age category with an upper bound corresponding to the maximum length of life, that we fixed to [56-110], to be consistent with HMD data.

To convert the HFD annual fertility rates to ‘SOCSIM’ monthly fertility rates, we assume that they are constant over the year and divide them by 12 months. The are identical for all marital status, but are specified for single and married women in the rate files. Other marital status (divorced, widowed, cohabiting) follow SOCSIM’s rate default rules.

For this exercise, we will use Period Age-specific fertility rates all birth orders combined, by calendar year and age for the desired country (saved in Input/asfrRR.txt), downloaded from the HFD website.

The function below converts HFD data to SOCSIM format following the procedure explained above. To run it, type the name of the country as used in HFD (here, “USA”).

write_socsim_rates_HFD(Country = "USA")

2.2.2. Mortality files for SOCSIM

For the input Age-Specific Probabilities of Death, we keep the whole age range included in HMD [0-110+] but limit the open-ended age interval 110+ to one year, i.e. [110-111). Originally, SOCSIM had an upper bound of 100 years, but in rsocsim the maximum age has been extended to 200 years.

We use the probabilities of death (qx) by sex from from HMD period life tables, that had been already smoothed for ages 80 and older, at which observed death rates M(x) display considerable random variation (For details, see the HMD Methods Protocol Wilmoth et al. 2021, 34–36).

To convert the annual HMD probabilities to monthly ‘SOCSIM’ probabilities, we assume that the probability of dying is constant over the year and use the formula $1-(1-_nq_x)^{1/n}$ proposed by Kenneth Watcher (Kenneth W. Wachter 2014, 53). For the open-ended age interval 110+, monthly probabilities are equal to 1/12. The probabilities of death are identical for all marital status of each sex, and are only specified for single women and single men in the rate files. Other marital status will follow SOCSIM’s rate default rules.

To write the input mortality files, please use the file Input/ltfper_1x1.txt (period life tables 1x1, Age interval × Year interval, for females and males for the U.S.A) downloaded from the HMD website.

The function below converts HMD data to SOCSIM format following the procedure explained above. To run it, type the name of the country as used in HMD (here, “USA”).

write_socsim_rates_HMD(Country = "USA")

2.3. Create initial population and marriage files

Now that we have the rate files, let’s create the initial population (.opop) and marriage (.opop) files. The initial .opop will have a size of 20000, with randomly assigned sex and dates of birth, and group 1 for all. Other columns will be completed during the simulation. The initial .omar will be an empty file.

# Set size of initial population
size_opop <-  20000

# Create data.frame with 14 columns and nrows = size_opop
presim.opop <- setNames(data.frame(matrix(data = 0, ncol = 14, nrow = size_opop)), 
                        c("pid","fem","group","nev","dob","mom","pop",
                          "nesibm","nesibp","lborn","marid","mstat","dod","fmult"))

# Add pid 1:sizeopop
presim.opop$pid <- 1:size_opop

# Add sex randomly
presim.opop$fem <- sample(0:1, nrow(presim.opop), replace = T)

# Add group 1 for all individuals
presim.opop$group <- 1

# Add random dates of birth (max age around 50)
presim.opop$dob <- sample(600:1200, nrow(presim.opop), replace = T)

# Write initial population for pre-simulation (without fertility multiplier)
write.table(presim.opop, "presim.opop", row.names = F, col.names = F)


## Create an empty data frame for presim.omar
presim.omar <- data.frame()

# Write empty omar for pre-simulation
write.table(presim.omar, "presim.omar", row.names = F, col.names = F)

2.4. Run a SOCSIM simulation using the `rsocsim` package

Now that we have created the input rate files and the initial population and marriage files, we can run the simulation. To use the socsim function, we need to specify the folder where the supervisory and the rate files are, the name of the supervisory file for the rates retrieved from HFD and HMD (1933-2020), and a seed number for the simulation.

# Specify the folder where the supervisory and the rate files are. 
# If the R session is running through the project, you can use the following command. 
folder <- getwd()

# Type the name of the supervisory file  stored in the above folder:
supfile <- "socsim_USA.sup"

# Choose a seed number (today's date) for the simulation:
seed <- "120423"

# Run a single SOCSIM-simulation with a given folder and the provided supervisory file
# using the "future" process method
rsocsim::socsim(folder, supfile, seed, process_method = "future")

The simulation results are ready. We can use the read_opop and read_omar functions from the rsocsim package to import the output population and marriage files into R.

## Read the opop file using the read_opop function
opop <- rsocsim::read_opop(folder = getwd(), supfile = "socsim_USA.sup", 
                           seed = "120423", suffix = "",  fn = NULL)

## [1] "read population file: U:/SOCSIM/rsocsim_workshop_paa/sim_results_socsim_USA.sup_120423_/result.opop"

## Read the omar file using the read_opop function
omar <- rsocsim::read_omar(folder = getwd(), supfile = "socsim_USA.sup", 
                           seed = "120423", suffix = "",  fn = NULL)

## [1] "read marriage file: U:/SOCSIM/rsocsim_workshop_paa/sim_results_socsim_USA.sup_120423_/result.omar"

Let’s have a glimpse on the simulated population and marriage files.

The population file (.opop) contains one record for each individual who was ever alive during the simulation. Each record (row) provides the following information for a given individual: person id (pid), sex (fem, 1 for female and 0 for male), group identifier (group), next scheduled event (nev), date of birth (dob), person id of mother (mom), person id of father (pop), person id of next eldest sibling through mother (nesibm), person id of next eldest sibling through father (nesibp), person id of last born child (lborn), id of most recent marriage in .omar ﬁle (marid), marital status at end of simulation (mstat), date of death (dod, or 0 if alive at end of simulation) and fertility multiplier (fmult). For more information on what these columns are, see: https://lab.demog.berkeley.edu/socsim/CurrentDocs/socsimOversimplified.pdf.

head(opop)

##   pid fem group nev  dob mom pop nesibm nesibp lborn marid mstat  dod    fmult
## 1   1   1     1  65  699   0   0      0      0     0     0     1 1502 0.172002
## 2   2   0     1  65 1070   0   0      0      0     0     0     1 1880 0.000000
## 3   3   1     1  65  857   0   0      0      0 23335  2414     3 1715 0.605228
## 4   4   0     1  65  859   0   0      0      0 23000  2233     4 1752 0.000000
## 5   5   1     1  65 1152   0   0      0      0 23947  2705     4 1820 0.701332
## 6   6   0     1  65 1069   0   0      0      0 28769  4770     3 2135 0.000000

The marriage file (.omar) contains one record for each marriage. Each marriage record provides the following information: marriage id number (mid), wife’s person id (wpid), husband’s person id (hpid), date marriage began (dstart), date marriage ended (dend, it’s 0 if in force at the end of simulation), reason marriage ended (rend), marriage id of wife’s next most recent prior marriage (wprior), marriage id of husband’s next most recent prior marriage (hprior).

head(omar)

##   mid  wpid  hpid dstart dend rend wprior hprior
## 1   1  4254 14055   1201 1535    3      0      0
## 2   2 12169  2424   1201 1680    3      0      0
## 3   3 19824 14932   1201 1565    3      0      0
## 4   4  6919  5970   1201 1514    3      0      0
## 5   5 12475 11182   1201 1609    3      0      0
## 6   6  1628 12760   1201 1700    3      0      0

2.5. Estimate age-specific rates from the SOCSIM microsimulation

To check the accuracy of our microsimulation for the USA, we can estimate and compare input and output age-specific fertility and mortality rates (ASFR and ASMR) by sex. For this purpose, we can use the estimate_fertility_rates and estimate_mortality_rates functions from the package. Please note that these calculate age-specific rates for both fertility and mortality, as they use mid-year population size by sex and age as an estimate of person-years lived during the year. Hence, the population in the denominator includes all those who were born before the $1^{st}$ of July of a given year and die in or after July of that year or are still alive at the end of the simulation. Due to the limited population size in a microsimulation (especially, of survivors at old ages), sometimes few or no individuals of a specific age are alive at a exact point in time (here, 1st July). Hence, it is possible to obtain rates higher than 1, equal to 0 (0 Events/Population), infinite (Events/0 Population) and NaN (0 Events/0 Population) values with these functions.
To run them, we need to define some arguments: the .opop file opop, final year of the simulation final_sim_year, the minimum and maximum years for which we want to estimate the rates, [year_min and year_max), the size of the year group year_group, the minimum and maximum age for fertility [age_min_fert and age_max_fert) or the maximum age for mortality age_max_mort) and the size of the age group age_group. We can compute the rates by different year and age groups (e.g., 1, 5, or 10), but please check that the minimum and maximum years and ages correspond to the groups size. Let’s run the code.

# Estimate age-specific fertility rates
asfr_sim <- rsocsim::estimate_fertility_rates(opop = opop,
                                             final_sim_year = 2020, #[Jan-Dec]
                                             year_min = 1935, # Closed [
                                             year_max = 2020, # Open )
                                             year_group = 5, 
                                             age_min_fert = 10, # Closed [
                                             age_max_fert = 55, # Open )
                                             age_group = 5) #[,)

# Estimate age-specific mortality rates
asmr_sim <- rsocsim::estimate_mortality_rates(opop = opop, 
                                             final_sim_year = 2020, # [Jan-Dec]
                                             year_min = 1935, # Closed
                                             year_max = 2020, # Open )
                                             year_group = 5,
                                             age_max_mort = 110, # Open )
                                             age_group = 5) #[,)

We can now plot our SOCSIM-derived estimates of ASFR and ASMR for women in some selected years.

# Choose years to plot (in intervals).
yrs_plot <- c("[1935,1940)", "[1980,1985)", "[2015,2020)") 

# Get the age levels to define them before plotting and avoid wrong order
age_levels <- levels(asmr_sim$age)

bind_rows(asfr_sim %>%
            mutate(rate = "ASFR",                   
                   sex = "female"),
          asmr_sim %>% 
            mutate(rate = "ASMR") %>% 
            filter(sex == "female")) %>% 
  mutate(age = factor(as.character(age), levels = age_levels)) %>% 
  # Filter rates of 0, infinite (N_Deaths/0_Pop) and NaN (0_Deaths/0_Pop) values
  filter(socsim !=0 & !is.infinite(socsim) & !is.nan(socsim)) %>% 
  filter(year %in% yrs_plot) %>% 
  ggplot(aes(x = age, y = socsim, group = year, colour = year)) +
  geom_line(linewidth = 1) +
  facet_wrap(. ~ rate, scales = "free") + 
  facetted_pos_scales(y = list(ASFR = scale_y_continuous(),
                               ASMR =  scale_y_continuous(trans = "log10"))) +
  theme_bw() +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  labs(title = "Age-Specific Fertility and Mortality Rates for Women in the USA \n retrieved from a SOCSIM microsimulation, 1933-2020",
       x = "Age", y = "Estimate")

2.6. Compare age-specific rates SOCSIM vs HFD/HMD

As an additional check, we can compare the age-specific fertility and mortality rates retrieved from the SOCSIM output with the original HFD and HMD data used as input for our microsimulation. Since our estimates of fertility and mortality rates from SOCSIM output are grouped by 5-year age groups and 5-year calendar periods in the previous example, we should first estimate the aggregated measures (5x5) from the original HFD and HMD data. For plotting the data together, we also need some data wrangling.

Let’s start with fertility estimates.

# Extract year and age breaks used in the estimate_fertility_rates() to apply the same values to HFD data
# Year breaks. Extract all the unique numbers from the intervals. 
year_breaks_fert <- unique(as.numeric(str_extract_all(asfr_sim$year, "\\d+", simplify = T)))

# Year range to filter HFD data
year_range_fert <- min(year_breaks_fert):max(year_breaks_fert-1)

# Age breaks of fertility rates. Extract all the unique numbers from the intervals 
age_breaks_fert <- unique(as.numeric(str_extract_all(asfr_sim$age, "\\d+", simplify = T)))

# Read the same HFD file used to write the input fertility files
HFD <- read.table(file = "Input/USAasfrRR.txt", 
                  as.is=T, header=T, skip=2, stringsAsFactors=F)

# Wrangle data and compute monthly fertility rates
HFD <- HFD %>% 
  mutate(Age = case_when(Age == "12-" ~ "12",
                         Age == "55+" ~ "55", 
                         TRUE ~ Age),
         Age = as.numeric(Age)) %>% 
  filter(Year %in% year_range_fert) %>% 
  mutate(year = cut(Year, breaks = year_breaks_fert, 
                    include.lowest = F, right = F, ordered_results = T),
         age = cut(Age, breaks = age_breaks_fert, 
                   include.lowest = F, right = F, ordered_results = T)) %>% 
  filter(!is.na(age)) %>% 
  group_by(year, age) %>%
  summarise(ASFR = mean(ASFR)) %>%
  ungroup() %>%
  mutate(Source = "HFD/HMD", 
         Rate = "ASFR")

# Wrangle SOCSIM data
SocsimF <- asfr_sim %>% 
  rename(ASFR = socsim) %>% 
  mutate(Source = "SOCSIM",
         Rate = "ASFR")

Then, the mortality estimates. Here, we compare the SOCSIM mid-year ASMR with central death rates from HMD life tables.

# Extract year and age breaks used in the estimate_mortality_rates() to apply the same values to HMD data
# Year breaks. Extract all the unique numbers from the intervals 
year_breaks_mort <- unique(as.numeric(str_extract_all(asmr_sim$year, "\\d+", simplify = T)))

# Year range to filter HMD data
year_range_mort <- min(year_breaks_mort):max(year_breaks_mort-1)

# Age breaks of mortality rates. Extract all the unique numbers from the intervals 
age_breaks_mort <- unique(as.numeric(str_extract_all(asmr_sim$age, "\\d+", simplify = T)))


# Read the same HMD female life table 1x1 file used to write the input mortality files
ltf <- read.table(file="Input/fltper_1x1.txt",
                  as.is=T, header=T, skip=2, stringsAsFactors=F)

# Read the same HMD male life table 1x1 file used to write the input mortality files
ltm <- read.table(file="Input/mltper_1x1.txt",
                  as.is=T, header=T, skip=2, stringsAsFactors=F)

# Wrangle HMD life tables
HMD <- ltf %>%
  select(Year, Age, mx) %>%
  mutate(Sex = "Female") %>% 
  bind_rows(ltm %>% 
              select(Year, Age, mx) %>%  
              mutate(Sex = "Male")) %>% 
  mutate(Age = ifelse(Age == "110+", "110", Age),
         Age = as.numeric(Age)) %>% 
  filter(Year %in% year_range_mort) %>% 
  mutate(year = cut(Year, breaks = year_breaks_mort, 
                    include.lowest = F, right = F, ordered_results = T),
         age = cut(Age, breaks = age_breaks_mort, 
                   include.lowest = F, right = F, ordered_results = T)) %>%
  filter(!is.na(age)) %>% 
  group_by(year, Sex, age) %>% 
  summarise(mx = mean(mx)) %>%
  ungroup() %>% 
    mutate(Source = "HFD/HMD",
           Rate = "ASMR")

# Wrangle SOCSIM data
SocsimM <- asmr_sim %>% 
  rename(mx = socsim) %>% 
  mutate(Sex = ifelse(sex == "male", "Male", "Female"),
         Source = "SOCSIM",
         Rate = "ASMR") %>% 
  select(year, Sex, age,  mx, Source, Rate)

Finally, we can plot together the age-specific fertility and mortality rates (ASFR and ASMR) from SOCSIM vs HFD and HMD for women in the USA.

## Plotting ASFR and ASMR (for females) from HFD/HMD vs SOCSIM 
bind_rows(HFD %>% rename(Estimate = ASFR), 
          SocsimF %>% rename(Estimate = ASFR)) %>% 
  mutate(Sex = "Female") %>%   
  bind_rows(HMD %>% rename(Estimate = mx),
            SocsimM %>% rename(Estimate = mx)) %>% 
  # Filter rates of 0, infinite (N_Deaths/0_Pop) and NaN (0_Deaths/0_Pop) values
  filter(Estimate != 0 & !is.infinite(Estimate) & !is.nan(Estimate)) %>% 
  filter(Sex == "Female") %>% 
  mutate(age = factor(as.character(age), levels = age_levels)) %>%
  filter(year %in% yrs_plot) %>% 
  ggplot(aes(x = age, y = Estimate, group = interaction(year, Source)))+
  facet_wrap(. ~ Rate, scales = "free") + 
  geom_line(aes(colour = year, linetype = Source), linewidth = 1)+
  scale_linetype_manual(values = c("HFD/HMD" = "solid","SOCSIM" = "11")) +
  facetted_pos_scales(y = list(ASFR = scale_y_continuous(),
                               ASMR =  scale_y_continuous(trans = "log10")))+
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  labs(title = "Age-Specific Fertility and Mortality rates in the USA (1933-2020) \n retrieved from HFD, HMD and a SOCSIM simulation", 
       x = "Age") + 
  theme_bw()

3. What can you do with simulation output? Example: estimates of bereavement

Instructor: Mallika Snyder, https://twitter.com/mallika_snyder

Objectives

Learn how to analyze SOCSIM output
Understand how SOCSIM records kin ties
Use SOCSIM output from the previous section to describe levels of family bereavement in the USA

After running our simulation code, SOCSIM provides us with a synthetic genealogy for the entire population. We know when people are born, when they die, when they marry, and who their parents and spouse are. What can we do with this information?

One advantage of using a tool like SOCSIM is that we can identify extended kin networks, especially for more distant kin that would be hard to link together using a census or a household survey. For example, how straightforward would it be to find an individual’s great-grandmother in a household survey, assuming odds of co-residence are not high? Of course, some kinds of kin are easier to identify than others, based on the information we have in SOCSIM (and some may be impossible to find). When identifying kin, we use lateral kin relationships—such as parents and children–or affinal kin relationships—such as spouses. This is the principle behind the retrieve_kin function included in this package, which can be used to identify kin from a variety of kin relationships (parents, grandparents, spouses, children, and much more). You can read the full documentation for this function by typing ?retrieve_kin into your RStudio console. An important note is that the kinship terms used in this function are currently North America-specific: other regions may have very different kinship systems, which may influence kin availability.

In addition to enabling us to more easily identify networks, SOCSIM also provides us with valuable information about the timing of vital events, like births, deaths, and marriages, in these networks. This can help us connect changes in the vital rates that go into SOCSIM to changes in the networks that our simulated people have available to them, often at a very fine temporal level (months or years). In this workshop, we will focus on just one example–kin loss–but there are many more possible ways to use SOCSIM to study these and other questions about kinship dynamics.

In previous research, SOCSIM has been especially helpful for studying kin loss in connection with mortality change (Zagheni 2011; Verdery et al. 2020; Snyder et al. 2022). Here we will focus on a simple example, changes in the rate of experiencing the loss of a relative over time. We will define as our reference group women aged 25-39 and alive in a given year: thus, the sample frame will change in each year. Note that the denominator is all women, not just those with a relative alive.

In our code, we will approach this by writing a function that we will loop over each year of interest. The function will take in a year, look for the individuals of the relevant age who are alive that year, identify the years of death of their kin, and then calculate the proportion of our sample who experienced a loss. For further intuition, we will also calculate the share of women who had a relative alive in the previous year. The code is a simplified version of functions used in a previous research paper (Snyder et al. 2022).

3.1. Getting our estimates

The first thing we require is a helper function to identify the year of interest.

#We won't discuss how this function works, but it uses the final simulation year (January-December)
#and the last simulated month to convert our monthly dates into yearly ones
asYr <- function(month, last_month=last_month, final_sim_year=final_sim_year) {
  return(final_sim_year - trunc((last_month - month)/12))
}

Now we can load the data and define a few relevant parameters.

## Read the opop file using the read_opop function
opop <- rsocsim::read_opop(folder = getwd(), supfile = "socsim_USA.sup", 
                           seed = "120423", suffix = "",  fn = NULL)

## [1] "read population file: U:/SOCSIM/rsocsim_workshop_paa/sim_results_socsim_USA.sup_120423_/result.opop"

## Read the omar file using the read_opop function
omar <- rsocsim::read_omar(folder = getwd(), supfile = "socsim_USA.sup", 
                           seed = "120423", suffix = "",  fn = NULL)

## [1] "read marriage file: U:/SOCSIM/rsocsim_workshop_paa/sim_results_socsim_USA.sup_120423_/result.omar"

#Parameters specific to this simulation: will need to be changed
last_month <- max(opop$dob) # Last simulated month
final_sim_year <- 2020

#Cleaning our population file
opop <- opop %>% 
 #Fixing dates of death for individuals still living at the end
  mutate(dod = if_else(dod == 0, last_month, dod)) %>%
  #Dates of birth and death in years for both individual and parents
  mutate(dob_year = asYr(dob, last_month=last_month, final_sim_year=final_sim_year),
         dod_year = asYr(dod, last_month=last_month, final_sim_year=final_sim_year))

Now that we’ve generated the variables we want, let’s create a simple function that we can loop through.

#Estimate annual kin loss for women aged 25-39 in a particular year
getKinLoss <- function(year_of_interest, 
                       opop = opop, omar = omar){
pid_data <- opop %>%
  #Remove those not alive in the year of interest
  filter(data.table::between(lower = dob_year, upper = dod_year, 
                             year_of_interest, incbounds = FALSE),
  #Filter to women
         fem == 1) %>%
  #Find the age of individuals alive
  mutate(age = year_of_interest - dob_year) %>%
  #Filter to women aged 25-39
  filter(data.table::between(lower = 25, upper = 39, age, 
                             incbounds = TRUE))
#Find vector of person IDs
pid_vec <- pid_data$pid

#Use the built-in retrieve_kin function
#We will not identify extended kin relationships (extra_kintypes)
#Or separate kin by sex, but both can be estimated
kin <- rsocsim::retrieve_kin(opop = opop, omar = omar, pid = pid_vec, 
             extra_kintypes = NULL, kin_by_sex = FALSE)

#From retrieve_kin, we get a list of lists named after the type of kin
#We loop over all types of kin included
temp <- bind_rows(lapply(1:length(kin), function(x) {
#Find IDs of kin
kin2 <- unlist(kin[[x]], use.names = FALSE)
#Attach associated person IDs
names(kin2) <- rep(pid_vec, unlist(lapply(kin[[x]], length)))
#Find the years of birth of kin
attr(kin2, "kin.dob.year") <- opop$dob_year[match(kin2, opop$pid)]
#Now find the years of death of kin
attr(kin2, "kin.dod.year") <- opop$dod_year[match(kin2, opop$pid)]
#Use this to find which kin were alive in the previous year
attr(kin2, "kin.alive.before.year_of_interest") <- 
  if_else(data.table::between(lower = attr(kin2, "kin.dob.year"), 
                              upper = attr(kin2, "kin.dod.year"), 
                             (year_of_interest-1), NAbounds = NA, 
                             incbounds = FALSE), 
          TRUE, FALSE)
#And use that to find which kin have died in this year
attr(kin2, "kin.death.year_of_interest") <- 
  if_else(attr(kin2, "kin.dod.year") %in% year_of_interest, TRUE, FALSE)

#Find the number of people who had a living relative in the previous year
pid.withkin <- length(unique(names(kin2[attr(kin2,
                                             "kin.alive.before.year_of_interest") 
                                        == TRUE])))

#Find the number of individuals who have lost a relative
pid.losekin <- length(unique(names(kin2[attr(kin2, "kin.death.year_of_interest") 
                                        == TRUE])))
#Find the total number of individuals
pid.total <- length(pid_vec)

#Generate output as a tibble
output <- tibble(kintype = names(kin[x]),
                 year = year_of_interest,
                 n_total = pid.total,
                 n_losekin = pid.losekin,
                 n_withkin = pid.withkin,
                 pc_losekin_total = 100*(n_losekin/n_total),
                 pc_withkin = 100*(n_withkin/n_total))
return(output)
}))
}

Now loop through this function to generate a table with our data for each year. To limit the time this code will take to run, we’ll look estimate kin loss at 10-year intervals between 1950 and 2010.

#Loop through and bind rows
full_data <- bind_rows(lapply(seq.int(from = 1950, to = 2010, 
                                      by = 10), 
                              function(x) getKinLoss(year_of_interest = x, 
                                                      opop = opop, 
                                                     omar = omar)))

3.2. Examining our estimates

#Clean this data slightly
kinloss_data <- full_data %>%
  #Clean names of kin
  mutate(kintype = if_else(kintype == "ggparents", "Great-grandparents", 
                           if_else(kintype == "gparents", "Grandparents", 
                                   if_else(kintype == "gchildren", "Grandchildren",
                                           kintype))),
         kintype = str_to_title(kintype))

Plot population-level kin loss over time.

kinloss_data %>% 
 #Create plot
  ggplot(aes(x = year, y = pc_losekin_total, color = kintype)) + 
  geom_line() +
  labs(x = "Year", y = "% of women aged 25-39 who lose a relative", 
       color = "Type of kin lost") +
  #Fix the axes
  scale_x_continuous(expand = c(0,0)) + 
  scale_y_continuous(expand = c(0,0)) +
  #Set a theme
  theme_bw()

These results show us how rates of parental loss have changed for women over time: as we would expect for this age group, the highest rates of kin loss would be for grandparents and parents. The increasing rate of grandparental loss over time, along with a similar trend for great-grandparents, could relate to increasing life expectancies during this period: as grandparents live to older ages, the chance that an individual aged 25-39 has a living grandparent increases. This is brought out in the next plot, which shows changes in kin availability (the share of the population with a living relative in the year prior) over time.

#Plot changes in kin availability
kinloss_data %>% 
 #Create plot
  ggplot(aes(x = year, y = pc_withkin, color = kintype)) + 
  geom_line() +
  labs(x = "Year", 
       y = "% of women aged 25-39 with a living relative in year (t-1)",
       color = "Type of relative") +
  #Fix the axes
  scale_x_continuous(expand = c(0,0)) + 
  scale_y_continuous(expand = c(0,0)) +
  #Set a theme
  theme_bw()

To discuss: What are you interested in learning about, and how do you think SOCSIM could help you in your work?

4. Learn more

The ‘SOCSIM Oversimplified’ manual written by Carl Mason (2016) is a great starting point to learn more about how SOCSIM works: https://lab.demog.berkeley.edu/socsim/CurrentDocs/socsimOversimplified.pdf.

You can find more information about the rsocsim in the package’s website: https://mpidr.github.io/rsocsim/, including documentation for the functions currently available and work in progress.

We are working on preparing extended vignettes with exercises and examples for using rsocsim. Follow us on twitter for updates. Please also get in touch if you’d like to use rsocsim for your work; we want to hear from you!

Tom Theile: https://twitter.com/TheileTom
Diego Alburez: https://twitter.com/d_alburez
Liliana P. Calderón-Bernal, https://twitter.com/lp_calderonb
Mallika Snyder: https://twitter.com/mallika_snyder
Emilio Zagheni: https://twitter.com/ezagheni

References

Alburez‐Gutierrez, Diego, Carl Mason, and Emilio Zagheni. 2021. “The ‘Sandwich Generation’ Revisited: Global Demographic Drivers of Care Time Demands.” Population and Development Review 47 (4): 997–1023. https://doi.org/10.1111/padr.12436.

Hammel, E. A. 2005. “Demographic Dynamics and Kinship in Anthropological Populations.” Proceedings of the National Academy of Sciences 102 (6): 2248–53. https://doi.org/10.1073/pnas.0409762102.

Hammel, E., D Hutchinson, K Wachter, R Lundy, and R Deuel. 1976. The SOCSIM Demographic-Sociological Microsimulation Program: Operating Manual. Institute of International Studies. University of California Berkeley.

HMD. Human Mortality Database. n.d. “Max Planck Institute for Demographic Research (Germany) and University of California, Berkeley (USA) and French Institute for Demographic Studies (France) Available at www.mortality.org.”

Human Fertility Database. n.d. “Max Planck Institute for Demographic Research (Germany) and Vienna Institute of Demography (Austria) Available at www.humanfertility.org.”

Margolis, Rachel, and Ashton M. Verdery. 2019. “A Cohort Perspective on the Demography of Grandparenthood: Past, Present, and Future Changes in Race and Sex Disparities in the United States.” Demography 56 (4): 1495–1518. https://doi.org/10.1007/s13524-019-00795-1.

Mason, Carl. 2016. “Socsim Oversimplified. Berkeley: Demography Lab, University of California.”

Snyder, Mallika, Diego Alburez-Gutierrez, Iván Williams, and Emilio Zagheni. 2022. “Estimates from 31 Countries Show the Significant Impact of COVID-19 Excess Mortality on the Incidence of Family Bereavement.” Proceedings of the National Academy of Sciences 119 (26): e2202686119. https://doi.org/10.1073/pnas.2202686119.

Verdery, Ashton M., and Rachel Margolis. 2017. “Projections of White and Black Older Adults Without Living Kin in the United States, 2015 to 2060.” Proceedings of the National Academy of Sciences 114 (42): 11109–14. https://doi.org/10.1073/pnas.1710341114.

Verdery, Ashton M., Emily Smith-Greenaway, Rachel Margolis, and Jonathan Daw. 2020. “Tracking the Reach of COVID-19 Kin Loss with a Bereavement Multiplier Applied to the United States.” Proceedings of the National Academy of Sciences 117 (30): 17695. https://doi.org/10.1073/pnas.2007476117.

Wachter, Kenneth W. 2014. Essential Demographic Methods. Harvard University Press.

Wachter, Kenneth W. 1997. “Kinship Resources for the Elderly.” Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 352 (1363): 1811–17. https://doi.org/10.1098/rstb.1997.0166.

Wilmoth, J, K Andreev, D Jdanov, D. A. Glei, T Riffe, C Boe, M Bubenheim, et al. 2021. “Methods Protocol for the Human Mortality Database.” University of California, Berkeley; Max Planck Institute for Demographic Research, Rostock. https://www.mortality.org/.

Zagheni, Emilio. 2011. “The Impact of the HIV/AIDS Epidemic on Kinship Resources for Orphans in Zimbabwe.” Population and Development Review 37 (4): 761–83. https://doi.org/10.1111/j.1728-4457.2011.00456.x.

alburezg / rsocsim_workshop_paa

Demographic microsimulations in R using SOCSIM: Modelling population and kinship dynamics

Introduction: What is SOCSIM and how does it work?

1. Setup

1.1. Installation

1.2. Install other packages

2. Running a simulation and retrieving the input demographic rates

Objectives

2.1. The simulation: supervisory and rate files

2.2. Write the input rate files for a SOCSIM simulation for the USA, 1933-2020

2.2.1. Fertility files for SOCSIM

2.2.2. Mortality files for SOCSIM

2.3. Create initial population and marriage files

2.4. Run a SOCSIM simulation using the `rsocsim` package

2.5. Estimate age-specific rates from the SOCSIM microsimulation

2.6. Compare age-specific rates SOCSIM vs HFD/HMD

3. What can you do with simulation output? Example: estimates of bereavement

Objectives

3.1. Getting our estimates

3.2. Examining our estimates

4. Learn more

References

About

Languages

Demographic microsimulations in R using SOCSIM: Modelling population and kinship dynamics

Introduction: What is SOCSIM and how does it work?

1. Setup

1.1. Installation

1.2. Install other packages

2. Running a simulation and retrieving the input demographic rates

Objectives

2.1. The simulation: supervisory and rate files

2.2. Write the input rate files for a SOCSIM simulation for the USA, 1933-2020

2.2.1. Fertility files for SOCSIM

2.2.2. Mortality files for SOCSIM

2.3. Create initial population and marriage files

2.4. Run a SOCSIM simulation using the rsocsim package

2.5. Estimate age-specific rates from the SOCSIM microsimulation

2.6. Compare age-specific rates SOCSIM vs HFD/HMD

3. What can you do with simulation output? Example: estimates of bereavement

Objectives

3.1. Getting our estimates

3.2. Examining our estimates

4. Learn more

References

About

Languages

2.4. Run a SOCSIM simulation using the `rsocsim` package