I retrieve data of water contamination levels in india over a period of 4 years and use map reduce on it to get desirable knowledge out of it. I then normalize and visualize the data using sci-kit and Matplotlib library in python.
- Python
- Scikit-learn
- Matplotlib
- Pandas
- Seaborn
- Hadoop Commands
- Jupyter Notebook
- Oracle VM VirtualBox
The dataset is acquired from the site of data.gov.in which offer real dataset.
The dataset is in the form of structured CSV delimited file.
The data consists of 4 years' data i.e.2009-2012.
The dataset has 8 attributes and is really huge, just the 2009's data has over 1,80,000 rows.
You can see the sample data in the file "Sample data.csv".
- Hadoop fs -put (filename)
- Hadoop fs -get (folder/file name)
- Hadoop fs -cat (filename)
- hs (mapper filename) (reducer filename) (input filename) (output filename)
- Hadoop fs -rm (filename)
- Hadoop fs -rmr (foldername)
This command puts the file into HDFS. We use it to put our dataset into the cluster.
This command get the file/folder from hdfs and puts it into your pc. We use this to retrieve the results for further use in JupyterNotebook.
This command is used to read the output retrieved after executing map reduce functions.
This command is an alias provided by cloudera that is used to execute the map reduce function.
This command is used to delete any non-empty folder/ any file in the HDFS.
This command is used to recursively delete files inside a folder.
After using hadoop map reduce commands I achieve a data which looks like this:
"Arsenic" 9499.0
"Fluoride" 33299.0
"Iron" 101708.0
"Nitrate" 2551.0
"Salinity" 32609.0
MAX: "Iron" 101708.0
MIN: "Nitrate" 2551.0
This data only represents the results achieved after running map reduce on 2009's data.
Similary, I ran the code on the rest of the data and achieved the same for the rest of the years.
Then, I import this data from the Virtual Machine to my windows and put it into a dataframe for the normalization and visualization part.
This data is now imported into the dataframe. Now we can call the methods of scikit library to normalize and visualize the data henceforth.
You can find the code in the pynb file in the repository.
This image shows the normalized data, it shows for each element what the trend it folows. That is, how much it's occurence in our country has changed over the course of four years. For more details, you can refer to the report included in the repository. It shows more detailed analysis of the data with more images showing how the trend for individual element has changed.