Vote and seat estimation using Barometre data

This project contains code to estimate vote shares and seats distributions using CEO Barometre data. The code is designed to be run from the Snakefile (see section on execution) but it can also be run interactively.

Structure of the code

src contains all the R scripts. The purpose of each script is listed below.
dta will host all the relevant results created by the scripts as well as the input data. Note that the scripts expect that the raw data is expected to live in dta/raw-dta. All the estimated models will be written to dta/models.
img will host the images generated by the scripts.
config includes a config.yaml file that defines configuration variables that are used throughout the code. Most of these variables are just the locations/names of the relevant folders.

Description of the scripts

The project is structured in the following way. Each script performs a single task. All the scripts read in some data and write out the results of the corresponding operations. The input and output data for each script is documented in the Snakefile file.

Note that all the scripts read from config/config.yaml. This is a YAML configuration that sets the main paths that are used in the project, like the location and name of the folders, and the final colors and names used for each party. The variables defined in the configuration file are attached to the R global environment for ease of use.

data-cleaning.R reads in the raw data in SPSS format and selects and transforms the variables that are used during the rest of the pipeline. It is worth noting that the file also transforms the names of the different parties to a format that can be used as factor names by R.
past-behavior.R estimates weights that match the reported electoral behavior the distribution of their primary language of the respondents to known frame values. A proportion of respondents say that they don't remember who they voted for or even whether they voted in the last election. For these individuals, model predictions are used.
expected-behavior.R estimates the electoral behavior of all respondents in the survey at the individual level. There are two behaviors of interest: whether the respondent will vote and the party they will vote for. A proportion of respondents do not report one or both behaviors and for them the expected behavior is, as before, assigned using two predictive models. The model for party choice assigns a party to each respondent but the model for turnout assigns a probability of voting. A cutoff probability is then estimated from the ROC curve of the model.
vote-shares.R uses the individual predictions about past and expected behavior and estimates vote shares at the Catalonia level. Individuals who reported that they don't know who they will vote for are assigned the predictions from the party choice model. Individuals with a probability of voting below the cutoff, are expected to not vote.
district-shares.R estimates district-level vote shares using a combination of survey data at the district level and some priors to compensate for the small sample size. The priors are set to the expected deviation between the electoral results from each district and that from from Catalonia in the previous election. This script uses the package dshare which needs to be installed separately.
seat-estimates.R uses the district-level vote shares to simulate the distribution of seats for each party. This script uses the package escons which needs to be installed separately.
report-figures.R prepares the final figures included in the report. Note: This file is not executed via the Snakefile and will likely contain dependencies different from those listed in the renv.lock.

The Snakefile will ensure that the scripts performing data analysis are executed in the correct order.

Execution and reproducibility

The project can be executed via the Snakefile which will run all the scripts in the correct order. More information about this file can be found in the Snakemake documentation. Make sure that Snakemake` is installed in your machine, for instance, using

pip3 install snakemake

and then run:

snakemake --cores all

Alternatively, the scripts can be run separately from the shell or from an interactive session. In this case, it is important to remember that all paths are currently set relative to the top folder. In other words, make sure that getwd() points to the folder in which the Snakefile lives -- that is, the folder above where all the R scripts live.

The order of execution is the order in which the files are listed above. The full project can be run manually using:

Rscript src/data-cleaning.R
Rscript src/past-behavior.R
Rscript src/expected-behavior.R
Rscript src/vote-shares.R
Rscript src/district-shares.R
Rscript src/seat-estimates.R

The project dependencies are listed in renv.lock. Check the renv package for more information about how to install them in a separate environment for reproducibility.

The machine learning models

The project uses three machine learning models. One to estimate past behavior, another to estimate expected party choice, and a third one to estimate whether the respondent will vote. All these three models use the same structure (including similar RHS variables) and very similar code. It is important to keep in mind that these models may take several hours to run.

One alternative to reduce runtime is to limit the size of the grid used for parameter search. For instance, the following snippet defines a search grid with 180 search points

grid_partychoice <- expand.grid(eta=c(.01, .005, .001),
                                max_depth=c(1, 2, 3),
                                min_child_weight=1,
                                subsample=0.8,
                                colsample_bytree=0.8,
                                nrounds=seq(1, 15, length.out=20)*100,
                                gamma=0)

The grid is then run 5 times over each of the 5 folds (see the variables FOLDS and REPEATS in config/config.yaml). That means 4,500 runs of a given model. It is possible to make the size of the grid smaller by more carefully selecting or searching some of the parameters above -- perhaps setting eta to a single, small value and focusing on identifying good values of nrounds.

marcbeldata / ceo-estimacions

Vote and seat estimation using Barometre data

Structure of the code

Description of the scripts

Execution and reproducibility

The machine learning models

About

Languages