Water Contamination Analysis using Hadoop and visualization using sci-kit

I retrieve data of water contamination levels in india over a period of 4 years and use map reduce on it to get desirable knowledge out of it. I then normalize and visualize the data using sci-kit and Matplotlib library in python.

Prerequisites & Software Required

Python
Scikit-learn
Matplotlib
Pandas
Seaborn
Hadoop Commands
Jupyter Notebook
Oracle VM VirtualBox

Dataset

The dataset is acquired from the site of data.gov.in which offer real dataset.
The dataset is in the form of structured CSV delimited file.
The data consists of 4 years' data i.e.2009-2012.
The dataset has 8 attributes and is really huge, just the 2009's data has over 1,80,000 rows.
You can see the sample data in the file "Sample data.csv".

Hadoop Commands Used

Hadoop fs -put (filename)

This command puts the file into HDFS. We use it to put our dataset into the cluster.

Hadoop fs -get (folder/file name)

This command get the file/folder from hdfs and puts it into your pc. We use this to retrieve the results for further use in JupyterNotebook.

Hadoop fs -cat (filename)

This command is used to read the output retrieved after executing map reduce functions.

hs (mapper filename) (reducer filename) (input filename) (output filename)

This command is an alias provided by cloudera that is used to execute the map reduce function.

Hadoop fs -rm (filename)

This command is used to delete any non-empty folder/ any file in the HDFS.

Hadoop fs -rmr (foldername)

This command is used to recursively delete files inside a folder.

Data Achieved:

After using hadoop map reduce commands I achieve a data which looks like this:

"Arsenic" 9499.0
"Fluoride" 33299.0
"Iron" 101708.0
"Nitrate" 2551.0
"Salinity" 32609.0
MAX: "Iron" 101708.0
MIN: "Nitrate" 2551.0

This data only represents the results achieved after running map reduce on 2009's data.
Similary, I ran the code on the rest of the data and achieved the same for the rest of the years.
Then, I import this data from the Virtual Machine to my windows and put it into a dataframe for the normalization and visualization part.

Normalization and Visualization:

This data is now imported into the dataframe. Now we can call the methods of scikit library to normalize and visualize the data henceforth.
You can find the code in the pynb file in the repository.

Results:

This image shows the normalized data, it shows for each element what the trend it folows. That is, how much it's occurence in our country has changed over the course of four years. For more details, you can refer to the report included in the repository. It shows more detailed analysis of the data with more images showing how the trend for individual element has changed.

About

Analyzing the trends of water contamination during the period of 2009-2012 with the help of Hadoop for map reduce and sci-kit for visualization and normalization of the data.

Languages

Language:Jupyter Notebook 98.7%Language:Python 1.3%