CPCBCCR Data Scraper

Scrapes data from CPCB's CCR dashboard.

Disclaimer

The work in this repo builds on top of the work done by Thejesh GN. I take no responsibility for Thej's work and Thej takes no responsibility for the work I have done in this repo. Please contact individual authors for any queries.

Code

This code uses the data.db file for everything.

First order of business is to set up the sites table in the db.

1. How to set up sites in DB

Add your sites to CSV

Go to CPCB's CCR website and select the state, city and station of your choice.
Open the Network tab in Dev Tools and click 'Submit' on the webpage.
In the Network tab, click on the POST request called fetch_table_data. Under the Request tab you'll see the payload of the request. Scroll to filtersToApply > parameterNames > station and copy the station code which should look something like - site_123.
Edit sites.csv and add the state, city, site and site_name.
Leave the header row as is. Leave the remaining columns blank.

Get available parameters for each site

Use csvjson.com's csv2json tool to create a sites.json file out of your edited sites.csv. Save that JSON in your root directory.
Run yarn or npm install in your root directory.
Run node cpcb_station_params.js and you'll get a sites_with_params.json file which expands sites.json by adding the list of available parameters for each site.
Use csvjson.com's json2csv tool to create a CSV and save it as sites_with_params.csv in your root directory.

Add the sites data to db

Download and install a tool like DB Browser for SQLite
Open the data.db file using DB Browser
Click on Import > Table from CSV and select the sites_with_params.csv file from before. Save this table named as sites. You can delete the pre-existing sites table to replace it.

2. Scrape the data

Now that your sites are set up, you can begin to scrape data.

Use python3 and install the requests, dataset and sqlite3 modules using pip. (Ideally inside a virtualenv using requirements.txt)

Run the following scripts in the given order –

get_availability.py: gets the months for which data is available for each site
check_availability.py: parses the JSON response from #1 into a list
expedite.py: populates the params_query and params_ids columns in the sites table
setup_pull.py: edit this script to setup the dates for which you need to get data (lines 37-39); running this script sets up all the requests that needs to be called to pull the data
pull.py: pulls the data setup in the previous script; data received is a JSON.
parse.py: parses the JSON data and creates the final data table in db

Notes

While all scripts should run quite swiftly, pull.py is going to be the slowest. Pinging the CPCB server takes time so be patient. And be kind and leave some timeout between subsequent pings.

You can browse the data for all stations in Delhi, Mumbai and Chennai from 01-01-2010 till 31-12-2020 in the reports directory. No need to fetch that again.

License

This code is licensed under GNU GPL v3.
Please credit by linking to https://thatgurjot.com and https://thejeshgn.com

patel-zeel / cpcbccr-data-scraper