LaibahAshfaq / Water-Well-Classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Water Well Classification

water well pic

Overview

Tanzania is currently going through a water crisis. Out of its population of 59 million people, 16 million people (28% of the population) lack access to safe water, and 44 million people (73%) lack access to safely managed household sanitation facilities(water.org, 2023). There are approximately 60, 000 water wells in the country; many require repair and are nonfunctional. Using the data from The Tanzania water ministry, our project aims to produce a machine-learning model that can accurately predict whether a well is functional or Non- Functional.

Business Understanding

The Stakeholders that would benefit from using a model to predict water well functionality are the UN's Water Aid Org, The Government of Tanzania and local non-profit groups. A model that would most benefit the population in need would have a low false negative rate, which means that the model doesn't predicts that it is functional when the well is nonfunctional. That would result in misclassification and the local population without water. We also want to minimize our false positive rates in order to prevent stakeholders from deploying mechanics to fix a perfectly functional well, which would be time costly.

Data Understanding

Our source for the dataset is Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. We found this data from the DrivenData competition Pump It Up: Data Mining the Water Table (https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/). From the original dataset we had 59, 400 entries and 41 features. After some strict EDA and a bit of feature engineering, we got 48, 651 entries and 12 features to work with.

Features:

  • basin: Geographic water basin
  • region: Geographic location
  • population: Population around the well
  • construction_year: year well was constructed
  • extraction_type_class: type of extraction used to make well
  • payment_type: how was it paid for
  • water_quality: quality of the well water
  • quality_group: quality of the water
  • quantity: quantity of well water
  • source: source of water
  • waterpoint_type: the kind of waterpoint
  • status_group: if its functional (0) or nonfunctional (1)

Exploratory Data Analysis

Screen Shot 2023-06-23 at 6 10 06 AM Screen Shot 2023-06-23 at 1 58 37 AM Screen Shot 2023-06-23 at 1 58 27 AM

Looked at the distribution of features across the two classes.

Modelling

Found the best model to be a K-n-neighbours model tuned to specific hyperparameters after conducting a grid search.

Screen Shot 2023-06-23 at 2 00 10 AM Screen Shot 2023-06-23 at 5 46 23 PM

Summary Metrics:

  • precision Score: 0.72
  • Recall Score: 0.77
  • f1 Score: 0.74
  • Cross-Validation Accuracy Scores 0.80
  • ROC = 0.88

We used different metrics to assess the model such as recall scores to assess the % of false negatives, our most important metric, and f1 scores to find the balance between false negative rates of defining a well as functional when it's not and false positive rates when calling it nonfunctional when it is functional.

The AUC is comparable to our baseline, but our recall score is much higher at 0.77. and the f1 score shows that there's a better balance with it being highest for this model at 0.74.

We can confidently say that the final model is best at minimizing both false negatives and false positives, which would be beneficial to both the people who need a functioning well and to the non-profit groups deployed to check up on non-functional wells.

Screen Shot 2023-06-23 at 5 46 34 PM

Our most important features for this model were found using permutation importance, which used the recall score to evaluate if each feature is important or not if it was removed from the model, and how much its absence decreased the score.

Evaluation

This model did considerably better than the other ones, based on the recall score. Our first priority is to minimize false negative errors, thus our recall score is what we optimized for and we sacrificed having a barely higher f1 score for a greater recall.

Recommendations

Some recommendations for next steps include looking at the most important features and prioritizing them:

1 . The quantity of water was the feature of most importance and so we should look at how it affects functionality and if it leads to a well becoming more dysfunctional or functional.

  1. Water quality should also be used to test if the water is drinkable or not. More quantitative data on the water quality such as salt and fluoride content can help us determine how drinkable certain water wells and their sources are.

  2. Having water wells close to nearby villages and having more data on how spread out these wells are from them to analyze how accessible these wells are. The nearer a well is, the better life outcomes for the people, specifically women who are disproportionately affected by distant wells because they are the main group collecting water for their families.

Next Steps

  1. To be able to use this model to predict how functional water wells are in other countries by proactively checking on wells that are predicted to be non-functional.

About


Languages

Language:Jupyter Notebook 100.0%