This repository contains the code to geocode polling stations in Brazil. We leverage administrative datasets to geocode all polling stations used in elections from 2006 to 2020.
We detail our methodology and limitations of our method in this document. As we explain in that document, our method often performs better than commercial solutions like the Google Maps Geocoding Service, particularly in rural areas. Despite our best efforts, however, it is important to note that this procedure inevitably will make mistakes and consequently some coordinates will be incorrect.
The latest dataset of geocoded polling stations can be found in the compressed csv file linked to on the release page. Version notes can be found here.
The dataset (geocoded_polliing_stations.csv.gz
) contains the following variables:
-
local_id
: Unique identifier for the polling station in a given election. This will vary across time, even for polling stations that are active in multiple elections. -
ano
: Election year -
sg_uf
: State abbreviation -
cd_localidade_tse
: Municipal identifier used by the TSE. -
cd_localidade_ibge
: Municipal identifier used by the IBGE -
nr_zona
: Electoral zone number -
nr_locvot
: Polling station number -
nr_cep
: Brazilian postal code -
nm_localidade
: Municipality -
nm_locvot
: Name of polling station -
ds_endereco
: Street address -
ds_bairro
: neighborhood -
pred_long
: Longitude as selected by our model. -
pred_lat
: Latitude as selected by our model -
pred_dist
: Predicted distance between chosen longitude and latitude and true longitude and latitude. This variable can be used to filter coordinates based on their likely accuracy. -
tse_lat
: Latitude provided by the TSE. This is only available for a small subset of data. -
tse_long
: Longitude provided by the TSE. This is only available for a small subset of data. -
long
: Longitude as predicted by the model or provided by the TSE. -
lat
: Latitude as predicted by the model or provided by the TSE.
We also created panel identifiers that track a given polling station over time. Note to construct this, we had to use fuzzy string matching of address and polling station name. The dataset panel_ids.csv.gz
has the following variables:
ano
: yearpanel_id
: unique panel identifier. Units with the samepanel_id
are classified to be the same polling station in two different election years according to our fuzzy matching procedure.local_id
: polling station identifier. Use this variable to merge with the coordinates data.panel_match_prop
: this variable measures the quality of the match. This is the proportion of words in the pollling station name and address that are exactly the same across years. A 1 indicates a perfect match between polling station name and address.
Note that for a small number of cases, a given polling station can be matched to multiple polling stations from an earlier year. This occurs when a later potential match is the best match for multiple polling stations in an earlier election.
We used the open source language R (version 4.0.3) to process the files and geocode the polling stations. To manage the pipeline that imports and processes all the data, we use the targets
package.
Assuming all the relevant data is in the ./data
folder, you can reconstruct the dataset using the following code:
#Set working directory to project directory
setwd(".")
renv::restore() #to install necessary packages
targets::tar_make() # to run pipepeline
Options to modify how the pipeline runs (e.g. parallel processing options) can be found in the _targets.R
file. The pipeline is in the targets.R
file as well. We use the renv
package to manage package dependencies. To ensure that you are using the right package versions, invoke renv::restore()
when the working directory is set to the github repo directory.
Given the size of some of the data files, you will likely need at least 50GB of RAM to run the code.
While one can get disaggregated electoral data directly from the TSE, I recommend obtaining polling station-level data from CEPESP DATA, as it has been cleaned, aggregated, and standardized.
For merging with electoral data provided by the TSE, you will typically have to work with data reported at the "seção" level, which is below the polling station level. Generally, one will need to aggregate the "seção"-level data to the polling station level, using municipality code, electoral zone code, and polling station code. Once aggregated, you can then merge with the coordinates data provided here.
As an example, I provide code for merging the 2018 electorate data, which is reported at the "seção" level, with the coordinates data.
library(data.table) #for importing and aggregating data
polling_coord <- fread("geocoded_polling_stations.csv.gz")
#Subset on 2018 polling stations
coord_2018 <- polling_coord[ano == 2018, ]
#import 2018 electorate data from TSE
electorate_2018 <- fread("eleitorado_local_votacao_2018.csv", encoding = "Latin-1")
#aggregate data to the polling station level
electorate_local18 <- electorate_2018[, .(electorate = sum(QT_ELEITOR)),
by = c("CD_MUNICIPIO", "NR_ZONA", "NR_LOCAL_VOTACAO")
]
#merge by municipality, zone, and polling station identifier
coord_electorate18 <- merge(coord_2018, electorate_local18,
by.x = c("cd_localidade_tse", "nr_zona", "nr_locvot"),
by.y = c("CD_MUNICIPIO", "NR_ZONA", "NR_LOCAL_VOTACAO")
)
Because of the size of some of the administrative datasets, we cannot host all the data necessary to run the code on Github.
Datasets marked with a * can be found at the associated link in the table below but not in this Github repo.
All other data can be found in the data
folder.
Data | Source |
---|---|
2010 CNEFE* | IBGE FTP Server |
2017 CNEFE* | IBGE Website |
INEP School Catalog | INEP Website |
Polling Stations Geocoded by TSE* | TSE |
Polling Station Addresses | Centro de Política e Economia do Setor Público |
Census Tract Shape Files* | geobr Package |
Municipal Demographic Variables | Atlas do Desenvolvimento Humano no Brasil |
Thanks to:
-
Yuri Kasahara for ideas and assistance in debugging
-
George Avelino, Mauricio Izumi, Gabriel Caseiro, and Daniel Travassos Ferreira at FGV/CEPESP for data and advice
-
Marco Antonio Faganello for excellent assistance at the early stages of the project.
-
Spatial Maps at http://spatial2.cepesp.io