Title of Manuscript: TODO
Authors: TODO
Corresponding Author: TODO
Code/Analysis By: Richard Barnes (richard.barnes@berkeley.edu)
DOI Number of Manuscript TODO
Code Repositories
This repository contains the analysis code used to produce some numbers, statistics, and figures for the manuscript above.
TODO
This step may be skipped if you have downloaded this code from Zenodo.
After cloning this repo you must acquire compactnesslib by running:
git submodule init
git submodule update
Compactnesslib relies on no external libraries beyond the C++11 standard library.
To generate the numbers used in the paper follow the following steps:
Export the CENSUSDIR
variable set to an appropriate value. Mine is one of:
export CENSUSDIR=/home/rick/data/gis/census
export CENSUSDIR=/home/rbarnes1/scratch/census
Run the following commands to acquire data
wget --continue --no-directories --directory-prefix=$CENSUSDIR -i data_files
wget --continue -r --no-parent -e robots=off --no-directories 'https://www2.census.gov/geo/tiger/TIGER2010BLKPOPHU/'
Because the census website is silly, you may have to fiddle with things to work around their anti-webbot security.
After acquiring the data, you must reorganize it into congressional_districts
,
tracts/
, block_groups/
, and blocks/
subdirectories within CENSUSDIR
as
appropriate. A quick examination of the following commands will give you the
idea.
You must also unpack the data.
The data must be in a planar projection for the calculations to yield correct results. We use an Albers Equal Area projection suitable for use on the conterminous United States. If area and perimeter are not being considered, just geographic proximity, then this projection can also be used for Alaska, Hawaii, and outlying areas.
The following lines reproject the data:
ls $CENSUSDIR/tracts/* | sed 's/.shp//' | xargs -n 1 -I {} ogr2ogr -f "ESRI Shapefile" {}_reproj.shp {}.shp -t_srs '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,-0,-0,-0,0 +units=m +no_defs'
ls $CENSUSDIR/block_groups/* | sed 's/.shp//' | xargs -n 1 -I {} ogr2ogr -f "ESRI Shapefile" {}_reproj.shp {}.shp -t_srs '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,-0,-0,-0,0 +units=m +no_defs'
ls $CENSUSDIR/blocks/* | sed 's/.shp//' | xargs -n 1 -I {} ogr2ogr -f "ESRI Shapefile" {}_reproj.shp {}.shp -t_srs '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,-0,-0,-0,0 +units=m +no_defs'
Since C++11 is used, on XSEDE's comet, you'll want:
module load gcc/6.2.0 #Only needed on XSEDE
Regardless, run:
cd scripts
make -j 4 #Set 4 to the number of cores you have available. Compilation is fast regardless
cd ..
On Comet you'll also need the C++ standard run-time library loaded:
export LD_PRELOAD=/share/apps/compute/gcc-6.2.0/lib64/libstdc++.so.6
Scores are generated with:
ls $CENSUSDIR/tracts/*_reproj.shp | sed 's/\.shp//' | xargs -t -n 1 -I {} ./scripts/net_compact.exe -sub {}.shp -sup $CENSUSDIR/congressional_districts/tl_2010_us_cd111_reproj.shp -outpre {}_out -shp -noprintparent
ls $CENSUSDIR/block_groups/*_reproj.shp | sed 's/\.shp//' | xargs -t -n 1 -I {} ./scripts/net_compact.exe -sub {}.shp -sup $CENSUSDIR/congressional_districts/tl_2010_us_cd111_reproj.shp -outpre {}_out -noshp -noprintparent
ls $CENSUSDIR/blocks/*_reproj.shp* | sed 's/\.shp//' | xargs -t -n 1 -I {} ./scripts/net_compact.exe -sub {}.shp -sup $CENSUSDIR/congressional_districts/tl_2010_us_cd111_reproj.shp -outpre {}_out -noshp -noprintparent
Spotchecks for visualization can be generated with:
ls $CENSUSDIR/block_groups/*_reproj.shp | sed 's/\.shp//' | shuf | head -n 3 | xargs -t -n 1 -I {} ./scripts/net_compact.exe -sub {}.shp -sup $CENSUSDIR/congressional_districts/tl_2010_us_cd111_reproj.shp -outpre {}_out -shp -printparent
ls $CENSUSDIR/blocks/*_reproj.shp* | sed 's/\.shp//' | shuf | head -n 3 | xargs -t -n 1 -I {} ./scripts/net_compact.exe -sub {}.shp -sup $CENSUSDIR/congressional_districts/tl_2010_us_cd111_reproj.shp -outpre {}_out -shp -printparent
The population data files were acquired above using:
wget --continue -r --no-parent -e robots=off --no-directories 'https://www2.census.gov/geo/tiger/TIGER2010BLKPOPHU/'
These files are assumed to be unpacked and living in a $CENSUSDIR/popdata
directory.
Now, we aggregate the block ids and population values using:
echo "BLOCKID10 POP10" > $CENSUSDIR/popdata/popdata.csv
for f in $CENSUSDIR/popdata/*dbf; do echo "$f"; dbfdump $f | awk '{print $5" "$8}' | tail -n +2 >> $CENSUSDIR/popdata/popdata.csv; done
mkdir -p results/
python3 -i ./scripts/aggregate_scores.py $CENSUSDIR/congressional_districts/tl_2010_us_cd111_reproj.shp '$CENSUSDIR/tracts/*.scores' '$CENSUSDIR/block_groups/*.scores' '$CENSUSDIR/blocks/*.scores' $CENSUSDIR/popdata/popdata.csv aggregate-results.csv
In this case, the output is a file called aggregate-results.csv
.
The table below (drawn from here) shows the GEOID structure in TIGER/Line Shapefiles for some of the most common legal and statistical geographies, as well as example GEOIDs for different geographic areas.
Area Type GEOID Structure Number of Digits Example Geographic Area Example GEOID
State STATE 2 Texas 48
County STATE+COUNTY 2+3=5 Harris County, TX 48201
County Subdivision STATE+COUNTY+COUSUB 2+3+5=10 Pasadena CCD, Harris County, TX 4820192975
Places STATE+PLACE 2+5=7 Houston, TX 4835000
Census Tract STATE+COUNTY+TRACT 2+3+6=11 Census Tract 2231 in Harris County, TX 48201223100
Block Group STATE+COUNTY+TRACT+BLOCK GROUP 2+3+6+1=12 Block Group 1 in Census Tract 2231 in Harris County, TX 482012231001
Block* STATE+COUNTY+TRACT+BLOCK 2+3+6+4=15 (Note – some blocks also contain a one character suffix (A, B, C, ect.) Block 1050 in Census Tract 2231 in Harris County, TX 482012231001050
Congressional District (113th Congress) STATE+CD 2+2=4 Connecticut District 2 0902
State Legislative District (Upper Chamber) STATE+SLDU 2+3=5 Connecticut State Senate District 33 09033
State Legislative District (Lower Chamber) STATE+SLDL 2+3=5 Connecticut State House District 147 09147
ZCTA ** ZCTA 5 Suitland, MD ZCTA 20746
* The block group code is not included in the census block GEOID code because the first digit of a census block code represents the block group code.
** ZIP Code Tabulation Areas (ZCTAs) are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas.
Outputs were generated using XSEDE's Comet supercomputer, which is supported by National Science Foundation grant number ACI-1053575 and was available to RB through the National Science Foudnation's Graduate Research Fellowship. RB was supported by the Department of Energy's Computational Science Graduate Fellowship (Grant No. DE-FG02-97ER25308). Travel funding for RB was provided by the Metric Geometry and Gerrymandering Group. RB's laptop, on which development took place, was provided by Adam Clark. In-kind support was provided by Isaac B., Hannah J., Vivian L., and Jerry W.