Water Well Classification

Overview

Tanzania is currently going through a water crisis. Out of its population of 59 million people, 16 million people (28% of the population) lack access to safe water, and 44 million people (73%) lack access to safely managed household sanitation facilities(water.org, 2023). There are approximately 60, 000 water wells in the country; many require repair and are nonfunctional. Using the data from The Tanzania water ministry, our project aims to produce a machine-learning model that can accurately predict whether a well is functional or Non- Functional.

Business Understanding

The Stakeholders that would benefit from using a model to predict water well functionality are the UN's Water Aid Org, The Government of Tanzania and local non-profit groups. A model that would most benefit the population in need would have a low false negative rate, which means that the model doesn't predicts that it is functional when the well is nonfunctional. That would result in misclassification and the local population without water. We also want to minimize our false positive rates in order to prevent stakeholders from deploying mechanics to fix a perfectly functional well, which would be time costly.

Data Understanding

Our source for the dataset is Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. We found this data from the DrivenData competition Pump It Up: Data Mining the Water Table (https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/). From the original dataset we had 59, 400 entries and 41 features. After some strict EDA and a bit of feature engineering, we got 48, 651 entries and 12 features to work with.

Features:

basin: Geographic water basin
region: Geographic location
population: Population around the well
construction_year: year well was constructed
extraction_type_class: type of extraction used to make well
payment_type: how was it paid for
water_quality: quality of the well water
quality_group: quality of the water
quantity: quantity of well water
source: source of water
waterpoint_type: the kind of waterpoint
status_group: if its functional (0) or nonfunctional (1)

Exploratory Data Analysis

Looked at the distribution of features across the two classes.

Modelling

Found the best model to be a K-n-neighbours model tuned to specific hyperparameters after conducting a grid search.

Summary Metrics:

precision Score: 0.72
Recall Score: 0.77
f1 Score: 0.74
Cross-Validation Accuracy Scores 0.80
ROC = 0.88

We used different metrics to assess the model such as recall scores to assess the % of false negatives, our most important metric, and f1 scores to find the balance between false negative rates of defining a well as functional when it's not and false positive rates when calling it nonfunctional when it is functional.

The AUC is comparable to our baseline, but our recall score is much higher at 0.77. and the f1 score shows that there's a better balance with it being highest for this model at 0.74.

We can confidently say that the final model is best at minimizing both false negatives and false positives, which would be beneficial to both the people who need a functioning well and to the non-profit groups deployed to check up on non-functional wells.

Our most important features for this model were found using permutation importance, which used the recall score to evaluate if each feature is important or not if it was removed from the model, and how much its absence decreased the score.

Evaluation

This model did considerably better than the other ones, based on the recall score. Our first priority is to minimize false negative errors, thus our recall score is what we optimized for and we sacrificed having a barely higher f1 score for a greater recall.

Recommendations

Some recommendations for next steps include looking at the most important features and prioritizing them:

1 . The quantity of water was the feature of most importance and so we should look at how it affects functionality and if it leads to a well becoming more dysfunctional or functional.

Water quality should also be used to test if the water is drinkable or not. More quantitative data on the water quality such as salt and fluoride content can help us determine how drinkable certain water wells and their sources are.
Having water wells close to nearby villages and having more data on how spread out these wells are from them to analyze how accessible these wells are. The nearer a well is, the better life outcomes for the people, specifically women who are disproportionately affected by distant wells because they are the main group collecting water for their families.

Next Steps

To be able to use this model to predict how functional water wells are in other countries by proactively checking on wells that are predicted to be non-functional.

LaibahAshfaq / Water-Well-Classification