QMCBT-JustinEvans / project-4_Individual

Created target of Life Expectancy feature engineered from the mean value of multiple numeric data features representing life expectancies under different independent circumstances.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Predicting Life Expectancy

Project Overview:

This is an individual project to demonstrate my ability to source and acquire data independently outside the classroom environment. This individual project is an opportunity for me to showcase my ability to plan and execute a project that follows the full Data Science pipeline from start to finish with a presentable deliverable that has meaning and value in its content. I have chosen to use data from the World Health Organization (WHO) because it is trustworthy and contains global input that spans many years.

Goals:

  • Locate, Acquire, and Prepare data from a reputable source
  • Explore, Model, and Evaluate to select a model that will outperform Baseline
  • Provide valid and insightful Observations and Recommendations

Reproduction of this Data:

Initial Thoughts

I started with an enormous data set that I thought would be incredible for predictions but after cleaning the data and preparing it for Machine Learning, there were so many NaN entries that the data was unusable.

The Plan

  • Follow the Data Science process map

  • Acquire data from WHO data webservice

  • Prepare data

  • Explore data in search of correlation to life expectancy

  • Answer the following initial question:

    • Question 1. Is there a relationship between Gender and life_expectancy?

    • Question 2. Is the Life Expectancy of Females greater than the Life Expectancy of Males?

    • Question 3. Is there a relationship between Year and life_expectancy?

    • Question 4. Is there a relational difference between the four observable Years in our data?

  • Develop a Model to predict life expectancy.

    • Use drivers identified to explore and build predictive models.
    • Evaluate models on train and validate data using MSE (Mean square Error)
    • Select the best model based on the least MSE
    • Evaluate the best model on test data
  • Draw conclusions

Data Dictionary

  • Documentation is hyperlinked
GHO Code Documentation Global Health Observatory (GHO) Code Description
WHOSIS_000001 πŸ“– Life expectancy at birth (years)
WHOSIS_000002 πŸ“– Healthy life expectancy (HALE) at birth (years)
WHOSIS_000007 πŸ“– Healthy life expectancy (HALE) at age 60 (years)
WHOSIS_000015 πŸ“– Life expectancy at age 60 (years)
YEAR The Year that the data was collected
SEX_BTSX Both Sex; this is a combination of both Male and Female Gender (1=True and 0=False)
SEX_MLE Male Gender (1=True and 0=False)
SEX_FMLE Female Gender (1=True and 0=False)
life_expectancy This feature is made from the mean of each of the WHOSIS features

Takeaways and Conclusions

Exploration Summary of Findings:

  • Gender seems to have an impact on life expectancy
  • Both Male and Female life expectancies raise over the years
  • In general year over year the life expectancy rate seems to maintain an upward trend
  • Women have a higher life expectany than men

Modeling Summary:

  • All models did slightly better than baseline.

  • None of the models were within acceptable proximity to actual target results

  • Our top model Simple Linear Regression Model was run on test data and performed better than baseline as expected and even outperformed its previous score on validation by approximately three base points.

For this itteration of modeling we have a model that beats baseline.

Conclusions:

  • Exploration:
    • We asked 4 Questions using T-Test and Anova Statistical testing to afirm our hypothesis
    • In general year over year the life expectancy rate seems to maintain an upward trend
    • Women have a higher life expectany than men
  • Modeling:
    • We trained and evaluated 6 different Linear Regression Models, all of which outperformed baseline
    • We chose the Simple Linear Regression Model as our best performing model
    • When evaluated on Test, it continued to outperform baseline and surpased its previous performance on validate
  • Recommendations:
    • I think we should hold off on deploying this model.
    • Even though it beat baseline, it came nowhere near actual.
    • We can acquire a much better dataset given more time
  • Next Steps:
    • I would like to request more time to investigate the data available on the Athena data webservice managed by the World Health Organization.
    • I also came across some similar projects that I can reference to research their findings in comparison to my own.

About

Created target of Life Expectancy feature engineered from the mean value of multiple numeric data features representing life expectancies under different independent circumstances.


Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%