Presentation
Lab Notebook (mostly unused)
Located in East Africa within the African Great Lakes region, Tanzania has a population of over 57,000,000 that faces significant challenges in accessing clean water. Many wells need repairs or have stopped working, while others have been added. To help address this issue, the Ministry of Water needs a classification model to identify broken wells.
I looked at three questions:
Is there a pattern in regards to:
- WHO?
- Government
- private business
- etc
- WHERE
- Hotspots
- Patterns in location
- PERMIT STATUS
- Does it affect probability of repair status?
The Taarifa Waterpoints dashboard is an open-source platform aggregating data from the Tanzania Ministry of Water. It helps citizens stay informed about water-related issues in Tanzania, empowering them to participate in water resource management.
Data Source
I began with a baseline model, which had an accuracy score of 54% on the test data.
I then developed a decision tree model, mainly for it's speed when compared to a logistic model.
When working with models that identify broken wells, it is important to prioritize minimizing false negatives. Showing up to a functioning well is worse than ignoring a broken one, which is why the Precision metric was used. The model can correctly identify negative instances and trade-offs with Accuracy by minimizing false negatives. However, this may result in identifying more false positives. The final model precision was 79%, correctly identifying 79% of broken wells as positive.
There was no clear pattern to identify individuals involved and instances were equally divided between breaking and not breaking.
In Iringa, there is a high number of broken wells, which is highly correlated with the longitude/latitude. Although there are hotspots, the model itself cannot pinpoint their exact location. Other indicators of broken wells include high population and the year of construction.
There doesn't seem to be an effect. Most wells are permitted in Tanzania, but it seems to be split down the middle both ways.
I want to emphasize an important finding: communal wells and hand pump wells are the waterpoints most likely to be affected. It is crucial to prioritize fixing them because they serve a large group of people and are prone to breaking down as more people use them.
One of the biggest problems with the data is that there are overlapping categories, so that it could use some cleaning up. Additionally, managing multicollinearity is proving to be very difficult. It may be helpful to look into hotspots where certain places are more likely than others to have issues. For example, Iringa could be an excellent place to focus on, specifically looking into communal standing pipes, handpumps, pumps with low water, old equipment, and a high population.