Geocoding Brazilian Polling Stations with Administrative Data Sets

This repository contains the code to geocode polling stations in Brazil. We leverage administrative datasets to geocode all polling stations used in elections from 2006 to 2020.

We detail our methodology and limitations of our method in this document. As we explain in that document, our method often performs better than commercial solutions like the Google Maps Geocoding Service, particularly in rural areas. Despite our best efforts, however, it is important to note that this procedure inevitably will make mistakes and consequently some coordinates will be incorrect.

The latest dataset of geocoded polling stations can be found in the compressed csv file linked to on the release page. Version notes can be found here.

Data

The dataset (geocoded_polliing_stations.csv.gz) contains the following variables:

local_id: Unique identifier for the polling station in a given election. This will vary across time, even for polling stations that are active in multiple elections.
ano: Election year
sg_uf: State abbreviation
cd_localidade_tse: Municipal identifier used by the TSE.
cd_localidade_ibge: Municipal identifier used by the IBGE
nr_zona: Electoral zone number
nr_locvot: Polling station number
nr_cep: Brazilian postal code
nm_localidade: Municipality
nm_locvot: Name of polling station
ds_endereco: Street address
ds_bairro: neighborhood
pred_long: Longitude as selected by our model.
pred_lat: Latitude as selected by our model
pred_dist: Predicted distance between chosen longitude and latitude and true longitude and latitude. This variable can be used to filter coordinates based on their likely accuracy.
tse_lat: Latitude provided by the TSE. This is only available for a small subset of data.
tse_long: Longitude provided by the TSE. This is only available for a small subset of data.
long: Longitude as predicted by the model or provided by the TSE.
lat: Latitude as predicted by the model or provided by the TSE.

We also created panel identifiers that track a given polling station over time. Note to construct this, we had to use fuzzy string matching of address and polling station name. The dataset panel_ids.csv.gz has the following variables:

ano: year
panel_id: unique panel identifier. Units with the same panel_id are classified to be the same polling station in two different election years according to our fuzzy matching procedure.
local_id: polling station identifier. Use this variable to merge with the coordinates data.
panel_match_prop: this variable measures the quality of the match. This is the proportion of words in the pollling station name and address that are exactly the same across years. A 1 indicates a perfect match between polling station name and address.

Note that for a small number of cases, a given polling station can be matched to multiple polling stations from an earlier year. This occurs when a later potential match is the best match for multiple polling stations in an earlier election.

Code

Running the Geocoding Pipeline

We used the open source language R (version 4.0.3) to process the files and geocode the polling stations. To manage the pipeline that imports and processes all the data, we use the targets package.

Assuming all the relevant data is in the ./data folder, you can reconstruct the dataset using the following code:

#Set working directory to project directory
setwd(".")
renv::restore() #to install necessary packages
targets::tar_make() # to run pipepeline

Options to modify how the pipeline runs (e.g. parallel processing options) can be found in the _targets.R file. The pipeline is in the targets.R file as well. We use the renv package to manage package dependencies. To ensure that you are using the right package versions, invoke renv::restore() when the working directory is set to the github repo directory.

Given the size of some of the data files, you will likely need at least 50GB of RAM to run the code.

Merging Coordinates with Electoral Data

While one can get disaggregated electoral data directly from the TSE, I recommend obtaining polling station-level data from CEPESP DATA, as it has been cleaned, aggregated, and standardized.

For merging with electoral data provided by the TSE, you will typically have to work with data reported at the "seção" level, which is below the polling station level. Generally, one will need to aggregate the "seção"-level data to the polling station level, using municipality code, electoral zone code, and polling station code. Once aggregated, you can then merge with the coordinates data provided here.

As an example, I provide code for merging the 2018 electorate data, which is reported at the "seção" level, with the coordinates data.

library(data.table) #for importing and aggregating data

polling_coord <- fread("geocoded_polling_stations.csv.gz")
#Subset on 2018 polling stations
coord_2018 <- polling_coord[ano == 2018, ]

#import 2018 electorate data from TSE
electorate_2018 <- fread("eleitorado_local_votacao_2018.csv", encoding = "Latin-1")

#aggregate data to the polling station level
electorate_local18 <- electorate_2018[, .(electorate = sum(QT_ELEITOR)),
        by = c("CD_MUNICIPIO", "NR_ZONA", "NR_LOCAL_VOTACAO")
]

#merge by municipality, zone, and polling station identifier
coord_electorate18 <- merge(coord_2018, electorate_local18,
        by.x = c("cd_localidade_tse", "nr_zona", "nr_locvot"),
        by.y = c("CD_MUNICIPIO", "NR_ZONA", "NR_LOCAL_VOTACAO")
)

Data Sources

Because of the size of some of the administrative datasets, we cannot host all the data necessary to run the code on Github. Datasets marked with a * can be found at the associated link in the table below but not in this Github repo. All other data can be found in the data folder.

Data	Source
2010 CNEFE*	IBGE FTP Server
2017 CNEFE*	IBGE Website
INEP School Catalog	INEP Website
Polling Stations Geocoded by TSE*	TSE
Polling Station Addresses	Centro de Política e Economia do Setor Público
Census Tract Shape Files*	`geobr` Package
Municipal Demographic Variables	Atlas do Desenvolvimento Humano no Brasil

Acknowledgements

Thanks to:

Yuri Kasahara for ideas and assistance in debugging
George Avelino, Mauricio Izumi, Gabriel Caseiro, and Daniel Travassos Ferreira at FGV/CEPESP for data and advice
Marco Antonio Faganello for excellent assistance at the early stages of the project.

Other Approaches

Spatial Maps at http://spatial2.cepesp.io
Pindograma

fdhidalgo / geocode_br_polling_stations