KaterynaD/2016-US-President-Election-Primary-Results-Analysis

Data Source: Kaggle 2016 US Presidential Election dataset

primary_results_candidates_correlation.py explores the correlations between candidates based on votes fractions they have in the same counties. The results are saved in CandidateCorrelation folder
1. Most two anti-correlated Democrat party candidates are Hillary Clinton and Bernie Sanders. The script calculates Pearson correlation coeficient, PValue and StdErr
  Here is the Pvalue and StdErr
  Here is the joinplot of these two candidates:
2. The more interesting question is anti-correlated Repupublican candidates. According to my analysis they are:
  - Donald Trump and Marco Rubio
  - Marco Rubio and Ted Cruz
  Both pairs have -0.49 Rvalue. The next pair is Marco Rubio and Mike Huckabee with -0.42 Rvalue The data has strong negative correlation, and it's significant as p-value is a lot lesser than 0.001
  
  Here is the Pvalue and StdErr
  Here are the joinplots of two first pairs:
3. Primary results assume a choice between Democrats candidates only or Republican candidates only So comparing Democrats to Republicans based on these results does not have a lot of sense However let's look on the picture as a whole
  
  or in this view
  Let's look now how high is the PValue for correlations between democrat and republican candidates We can not trust such results
4. And at the end the pairplot for the data set:
county_facts_candidates_correlation.py explores the correlations between candidates and county facts based on votes fractions they have in each county. The results are saved in FactCandidateCorrelation folder
1. There is a strong correlation between percent of Asian and Bernie Sanders votes fraction. In the opposite, Hillary Clinton has anti-correlation with Asian percent and stong positive correlation with White percent.
  
  The PValue is small enough to trust the results
2. Here is the similar analysis for republicans. the results are more sparse but what we can see the strong positive relationship between percent of Housing units in multi-unit structures and votes fractions of John Kasich, Marco Rubio and Rand Paul.
  There is also the strong correlation between percent of Bachelor's degree or higher and the same republican candidates
  The PValue is very low and we can trust the results.
  Interesting, Donald Trump has the strong anti-correlated results with the percent of Bachelor's degree or higher Fact with a low PValue
  He has a moderate positive correlation with the percent of Persons 65 years and over. However the PValue is high in this case
  Marco Rubio fraction votes is strongly anti-correlated with the percent of Persons 65 years and over fact and PValue is very low.
3. Here is the full picture: RValue and Pvalue The fact dictionary is here.

LinearRegression.py predicts primary results fraction votes based on demographic county facts. Hillary Clinton and Bernie Sanders fraction votes are most correlated to the county facts. The variance is above 0.6 for these 2 candidates. The quality of the predicted values for the rest of the candidates is low with 0.4 and less varience values.

Ordinary least squares method works perfectly fine fo the data. The rest of the method can give a slightly better results but not very significant

Hillary Clinton fraction votes prediction residual plot for ordinary least squares method, not normalize data

Hillary Clinton prediction joint plot for ordinary least squares method, not normalize data

The files with the predicted data and plots for each candidates can be found in LinearRegressionPredictionPrimary folder

Other candidates prediction data models fit for different methods and parameters. More data can be found here

candidate	method	normalize	MSE Train set	MSE Test set	Variance
Hillary Clinton	LeastSquares	Y	0.011	0.011	0.614
Hillary Clinton	LeastSquares	N	0.011	0.011	0.614
Hillary Clinton	Ridge 0.010	Y	0.011	0.011	0.616
Hillary Clinton	Ridge 0.010	N	0.011	0.011	0.614
Hillary Clinton	Lasso 0.000	Y	0.012	0.011	0.627
Hillary Clinton	Lasso 0.000	N	0.011	0.011	0.618
Hillary Clinton	BayesianRidge	Y	0.011	0.011	0.620
Hillary Clinton	BayesianRidge	N	0.011	0.011	0.610
Bernie Sanders	LeastSquares	Y	0.010	0.010	0.642
Bernie Sanders	LeastSquares	N	0.010	0.010	0.642
Bernie Sanders	Ridge 0.010	Y	0.010	0.010	0.643
Bernie Sanders	Ridge 0.010	N	0.010	0.010	0.642
Bernie Sanders	Lasso 0.000	Y	0.011	0.010	0.649
Bernie Sanders	Lasso 0.000	N	0.010	0.010	0.646
Bernie Sanders	BayesianRidge	Y	0.010	0.010	0.643
Bernie Sanders	BayesianRidge	N	0.010	0.010	0.640
Donald Trump	LeastSquares	Y	0.005	0.006	0.401
Donald Trump	LeastSquares	N	0.005	0.006	0.401
Donald Trump	Ridge 0.010	Y	0.005	0.006	0.426
Donald Trump	Ridge 0.010	N	0.005	0.006	0.402
Donald Trump	Lasso 0.000	Y	0.005	0.006	0.417
Donald Trump	Lasso 0.000	N	0.005	0.006	0.407
Donald Trump	BayesianRidge	Y	0.005	0.006	0.428
Donald Trump	BayesianRidge	N	0.005	0.006	0.411
Marco Rubio	LeastSquares	Y	0.004	0.004	0.228
Marco Rubio	LeastSquares	N	0.004	0.004	0.228
Marco Rubio	Ridge 0.010	Y	0.004	0.004	0.242
Marco Rubio	Ridge 0.010	N	0.004	0.004	0.228
Marco Rubio	Lasso 0.000	Y	0.004	0.005	0.226
Marco Rubio	Lasso 0.000	N	0.004	0.004	0.242
Marco Rubio	BayesianRidge	Y	0.004	0.004	0.253
Marco Rubio	BayesianRidge	N	0.004	0.004	0.243
Ted Cruz	LeastSquares	Y	0.009	0.008	0.326
Ted Cruz	LeastSquares	N	0.009	0.008	0.326
Ted Cruz	Ridge 0.010	Y	0.009	0.008	0.348
Ted Cruz	Ridge 0.010	N	0.009	0.008	0.326
Ted Cruz	Lasso 0.000	Y	0.009	0.008	0.374
Ted Cruz	Lasso 0.000	N	0.009	0.008	0.322
Ted Cruz	BayesianRidge	Y	0.009	0.008	0.362
Ted Cruz	BayesianRidge	N	0.009	0.008	0.318

The rest of scripts were used to generate data for Tableau

About

Correlation analysis between candidates and county facts based on 2016 US President Election Primary Results by county

Languages

Language:Python 81.8%Language:HTML 18.2%