JoelGrayson / ai-final-project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I attempted to create a model which can predict how a county votes in the presidential elections (Democratic or Republican) based on information collected from the American Community Survey (ACS). The program will create a map of the counties as red and blue. I chose the ACS over the decennial census because it collects data every year and has more features. The features are Housing Occupancy, Units in Structure, Year Structure Built, Rooms, Bedrooms, Housing Tenure, Year Householder Moved into Unit, Vehicles Available, House Heating Fuel, Selected Characteristics, Occupants per Room, Value, Mortgage Status, Selected Monthly Owner Costs (SMOC), Selected Monthly Owner Costs as a Percentage of Household Income (SMOCAPI), Gross Rent, and Gross Rent as a Percentage of Household Income (GRAPI).

First, I downloaded the ACS' 5-year average data for 2020 and 2010 from data.census.gov's advanced search, which I had learned how to use at a community board member census training session. Then, I had to preprocess it by removing the margin of error and notes columns. I dropped all the various symbols that corresponded to null, such as '(X)' and '**'. I downloaded the presidential voting data from sources that I found in the footnotes of Wikipedia pages since the federal election commission's centralized data is not in a developer-friendly format (PDFs and mixed Excel spreadsheets). I downloaded from kaggle.com/datasets/unanimad/us-election-2020 for the 2020 election and github.com/john-guerra/US_Elections_Results for 2012. Both were of different formats and needed to be preprocessed differently in the files preprocessing/data_2020.py and preprocessing/data_2010_2012.py. The 2020 data had county and state names instead of FIPS codes, so I inserted a FIPS column using another dataframe that had FIPS codes and state and county names from. Then, I was able to merge the data frames on FIPS codes using an inner join. I iterated through 2020 rows to create the party column, which was what percentage of all Democratic and Republican voters voted Democratic.

One bug I faced was that I stored my FIPS codes as numbers instead of strings, which caused problems when rendering the map because it did not recognize "1000" as "01000." I padded a zero at the front to solve this problem.

After preprocessing, I tried many different models with these results:

2010-2012 2020
Continuous Mean Classifier (Baseline) MSE 0.0248 MSE 0.0267
Continuous Linear Regression MSE 0.0476 MSE 0.0838
Continuous Support Vector Regression MSE 0.409 MSE 0.268
Binary Majority Classifier (Baseline) MSE 0.386
Log Loss 13.91
Accuracy 0.614
MSE 0.167
Log Loss 6.036
Accuracy 0.833
Binary Logistic Regression MSE 0.228
Log Loss 8.215
Accuracy 0.772
Precision 0.758
Recall 0.602
MSE 0.114
Log Loss 4.121
Accuracy 0.886
Precision 0.634
Recall 0.750
Binary Decision Trees MSE 0.273
Log Loss 9.827
Accuracy 0.727
Precision 0.641
Recall 0.669
MSE 0.148
Log Loss 5.340
Accuracy 0.852
Precision 0.557
Recall 0.567
Binary Random Forest MSE 0.202
Log Loss 7.311
Accuracy 0.797
Precision 0.790
Recall 0.647
MSE 0.098
Log Loss 3.541
Accuracy 0.902
Precision 0.864
Recall 0.490
Binary KNN MSE 0.230
Log Loss 8.294
Accuracy 0.770
Precision 0.734
Recall 0.633
MSE 0.132
Log Loss 4.759
Accuracy 0.868
Precision 0.893
Recall 0.240
Binary Naive-Bayes MSE 0.366
Log Loss 13.207
Accuracy 0.634
Precision 0.578
Recall 0.189
MSE 0.135
Log Loss 4.875
Accuracy 0.865
Precision 0.885
Recall 0.221
Binary Bernoulli Naive Bayes MSE 0.274
Log Loss 9.866
Accuracy 0.726
Precision 0.618
Recall 0.763
MSE 0.177
Log Loss 6.385
Accuracy 0.823
Precision 0.478
Recall 0.635
Binary Support Vector Classifier MSE 0.260
Log Loss 9.355
Accuracy 0.740
Precision 0.653
Recall 0.698
MSE 0.143
Log Loss 5.166
Accuracy 0.857
Precision 0.569
Recall 0.596

I then used plotly's choropleth library to render the maps colorfully by county level.

For continuous models, no model performed better than the baseline and Random forest performed best on binary. This shows that housing data is not a great predictor of how a county will vote in the election, a surprising conclusion.

Looking at the rendered map that the 2020 random forest binary model predicts, one can see that the model clearly does not understand which counties are supposed to be red or blue. It seemingly chooses blue or red randomly, not at all in the supposed relationship. The average party for 2020 is 0.338, suggesting that it should predict Republican most of the time. This is because many Democratic counties are urban and have a large population size (such as New York County at 1.6 million). In order to prevent all the counties from being weighed equally, I added the total number of votes per county as the weights, the optional third parameter for the model.fit method.

In linear_model_features.py, you can see how the housing factors affect county data. Here is a list of all the results of the run program. Unfortunately, many of the features are inconsistent between the two times I ran the program. However, there were some understandable features such as BEDROOMS \> Total housing units \> 5 or more bedrooms (DP04\_0044PE) being negative in both cases since rich people tend to be more Republican. Interestingly, all the ROOMS \> Total housing units features were negative and UNITS IN STRUCTURE \> Total housing units were positive but they were negative or positive to different degrees, causing the differences.

To improve the results, I likely should use the census data, which has more economic, population, and racial features that would correlate better with party. Although my model may not work as well as I had hoped, I learned a lot throughout this final project, such as how to manage a large python code base over time and separate the process into three stages (preprocessing, models, and rendering the map).

Throughout the process, I used different modules and folders to keep code organized. The top-level folder is a package, indicated by __init__.py, allowing all imports in modules and submodules to be from that relative path. This allows me to import preprocessing/helpers/fips2county_name.py from the file map/render_map using from preprocessing.helpers.fips2county_name import fips2county_name.

I used many regular expressions, my old friend. For example, replacing (^\d{4}(?!\d)) with 0$1 to prepend four FIPS digit codes with a 0 in the csv file and (\d{5}) with "$1" to surround the FIPS codes with quotes.

Quick Facts

  • 3143 counties in the United States
  • 3106 counties in 2020 preprocessed data
  • 3114 counties in 2010-2012 preprocessed data (found with awk -F ',' '{print $1}' \<'./preprocessing/dist/2010-acs-and-2012-votes.csv' | uniq | sort | wc --lines)
  • Elections inspected: 2008 and 2020
  • Corresponding census data: 2010 and 2020 (ACS)

Sources

Notes

  • How the election works - each county has multiple precincts which correspond to a voting poll station. The counties then report to the state.
  • Choropleth map - shows the intensity of each region in a map through color intensity
  • Federal Information Processing Standards (FIPS) codes are standardized five-digit numeric codes identifying counties
  • Election Districts (EDs in Alaska)
    • Boroughs are Alaska's form of counties (parishes for Louisiana)

Running Requirements

pip packages: kaleido pandas requests plotly
Pro tip: pipe python3 model_comparison.py into glow for rendering markdown. Install glow here.

About


Languages

Language:Python 82.7%Language:HTML 16.2%Language:Shell 1.2%