jeneyring / GSS_survey

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Zodiac Prediction Model using the GSS 2021 survey

About the Project:

The GSSurvey asks US adults questions related to current demographics, opinions, religious/spiritual beliefs, political and life view points, social class opinions, as well even recording peoples zodiac signs.

This project is an interest project, taking the GSS participant responses and comparing their zodiac responses to their other survey answers to see if a Classification Model can accurately predict their Zodiac sign/birth month.

Project Goals

The goals of this project are to build a Classification Model that can beat the Baseline model of 10% accuracy prediction in predicting a person's zodiac sign when answering the GSSurvey questions.

This project goes through the full Data Science pipeline in acquiring the 2021 GSS survey responses, cleaning and preparing this dataset, exploring and conducting statistical chi^2 tests for 6 different hypothesis questions, and then conducting 5 different Classification Models on the found main drivers.

Background Data information:

ABOUT THE GSS: Since 1972, the General Social Survey (GSS) has been conducted annually by the Univeristy of Chicago to measure the responses of contemporary American Society--through the lens of adult survey participants. This project uses the GSS 2021 dataset.

As the GSS 2021 is a general survey of each participant's response on their beliefs, opinions, demographics, and relationships, this model will be using the Zodiac responses as a general way to see if one's sign can actually be predicted by certain responses, and if a model could actually be built on this.

The columns selected pulled from the GSS data in this project where done so by targeting key points of interest that are often covered in astrology:

  • Career
  • Relationships
  • Opinions
  • Health (physical and mental)
  • Political stance/view points
  • Religion/Spirituality
  • Interests and Overall outlook of others and life

ABOUT THE ZODIAC SIGNS: As many viewpoints on Astrology overall seem to be changing, oftentimes in terms of self-identity and decision factors, this project is taking a look at zodiac signs (using participants Sun Sign) to determine if the responses of the US adult partipants of the GSS actually correlates to the astrological sun signs they provided and align with.

Data Dictionary:

Some of the columns and datasets may not fully makes sense to those first meeting this data. Below is a table that helps define the features and terminology used in this project.

Target Datatype Definition
GSS 2021 4032 non-null: Category General Social Survey of 2021
Feature Datatype Definition
zodiac 4032 non-null: Category participant's zodiac birth sign
sexornt 4032 non-null: Category participant's sexual orientation
marital 4032 non-null: Category participant's marital status
res16 4032 non-null: Category participant's region of living
degree 4032 non-null: Category participant's highest earned education degree
class 4032 non-null: Category participant's self rated social class
relig 4032 non-null: Category participant's religion
postlifev 4032 non-null: Category participant's view of life after death
sprtprsn 4032 non-null: Category participant's self rating of being a spiritual person
fairv 4032 non-null: Category participant's view of others being fair
trustv 4032 non-null: Category participant's view of others being trustworthy
conmedic 4032 non-null: Category participant's trust of medical system
happy 4032 non-null: Category participant's overall happiness with life
workhard 4032 non-null: Category participant's 1-5 rating of most important life lesson of working hard
occ10 4032 non-null: Category participant's occupation trade
hlthmntl 4032 non-null: Category participant's self rating of their mental health
socrel 4032 non-null: Category participant's rating of how often they spend time with relatives/family
decevidc 4032 non-null: Category participant's rating of how much they agree of needing sufficient evidence for decisions
polviews 4032 non-null: Category participant's rating of their political viewpoints

Sample of Hypothesis' Questions asked:

For the key features in comparison as categorical features of zodiac predictors.

Question: Relationship/marital status can help classify a person's zodiac?

  • H_0: Marital status is independant of zodiac sign.
  • H_a: Marital status is dependant of zodiac sign.

Question: Opinions on working hard in life can help classify a person's zodiac?

  • H_0: workhard opinion is independant of zodiac sign.
  • H_a: workhard opinion is dependant of zodiac sign.

Question: Political viewpoints can help classify a person's zodiac?

  • H_0: Political viewpoints is independant of zodiac sign.
  • H_a: Political viewpoints is dependant of zodiac sign.

Project Steps:

Acquire

Within the wrangle.py file there are functions that will help to:

  • Pull in the 2017, Single Family Property residents with no duplicates
  • Read the SQL acquire query to a csv file
  • Assign the data to a variable

Prepare

Within the wrangle.py file there are functions for the prepare stage that:

  • Clean up the nulls by dropping all from zodiac column and creating unanswered columns as 'unknown' to keep data that would otherwise be lost.
  • Drop any unneccessary and duplicated columns
  • Resets dtypes to Objects and int
  • Splits data into train, validate and test

Explore

Within the explore.ipynb file, there is a step-by-step breakdown of my Exploration methods. Main features found and mentioned are:

  • workhard: how participants rate (1-5) 'working hard' as a key life lesson they would teach a child.
  • hlthmntl: each participant's self rating of their mental health
  • socrel: how often participants spend time with relatives/family.
  • decevidc: how participants rate needing sufficient evidence in making decisions

Model

The model.ipnyb notebook holds each method of model used in this project. Models include steps to produce: Decision Tree Classifier| Random Forest| and 5 different KNN Models (with varying n_number of nearest neighbors)

The function for the final chosen model can be found in model.py and Final Report.

Conclusion

Using the best Classification model of Random Forest, the results did not beat the baseline prediction of 10%.

Moving forward, I would want to try creating cluster features and utilize feature engineering like SelectKBest to find better predictors for this model.

I would also want to add past GSS responses to add in more data and see if past years might help with adding to predictions.

About


Languages

Language:Jupyter Notebook 99.5%Language:Python 0.5%