Data-cleaning

Model Prediction

Costa Rican Household Poverty Level Prediction]

Roadmap

I spent the last couple of months analyzing data from sensors, surveys, and logs. No matter how many charts I created, how well sophisticated the algorithms are, the results are always misleading.

Throwing a random forest at the data is the same as injecting it with a virus. A virus that has no intention other than hurting your insights as if your data is spewing garbage.

Even worse, when you show your new findings to the CEO, and Oops guess what? He/she found a flaw, something that doesn’t smell right, your discoveries don’t match their understanding about the domain — After all, they are domain experts who know better than you, you as an analyst or a developer.

Right away, the blood rushed into your face, your hands are shaken, a moment of silence, followed by, probably, an apology.

That’s not bad at all. What if your findings were taken as a guarantee, and your company ended up making a decision based on them?.

You ingested a bunch of dirty data, didn’t clean it up, and you told your company to do something with these results that turn out to be wrong. To avoid these, here is asimple but very effective way of how I clean data no matter how big it is.

Languages and Utilities Used

Python
Anaconda
Jupyter Notebook

Environments Used

Windows 10 (21H2)

Program walk-through:

Imp data:

SodiqSrb / Data-cleaning