thesagarsehgal / SwatchBharatUrbanCrawler

This is a Crawler built in Scrapy to crawl over the https://sbmurban.org/ website. This is the repository that crawls ASP.NET websites using Scrapy using the __VIEWSTATE.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Swatch Bharat Urban Crawler

This is a crawler that crawls the complete website https://sbmurban.org/rrr-centers and extracts the complete information.

About

  • This crawler was build as a task for ATLAN.
  • The complete data from the website was crawled and stored in a single file
  • This was a new task and learned how to scrap the ASP.NET websites which use __VIEWSTATE (Used the https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition as a tutorial on how to Crawl ASP.NET websites).
  • Also at the end of complete data scrapping, a POST request would be made to the URL specified.
  • Also, the setup.py file has been added.
  • The extracted file contains the following columns:-
    • State
    • District
    • ULB Name
    • Ward
    • No. of Applications Received
    • No. of Applications Not Verified
    • No. of Applications Verified
    • No. of Applications Approved
    • No. of Applications Approved having Aadhar No.
    • No. of Applications Rejected
    • No. of Applications Pullback
    • No. of Applications Closed
    • No. of Constructed Toilet Photo
    • No. of Commenced Toilet Photo
    • No. of Constructed Toilet Photo through Swachhalaya

Doubts/Assumptions

  1. DOUBT=> How can we make a POST request every 5 minutes, since the data crawling itself takes a lot more of time.

ASSUMPTION=> For making the post request every 5 minutes, we can put the project in the ScrapingHub, and schedule it to crawl every 5 minutes. The crawler has been made such that it would make a POST request on completing the crawling, and the data would automatically be posted.

  1. DOUBT=> How many output files are required? Like 1 file containing all the information. Or the 4 Files containing information for 4 different levels like State, District, ULB and Ward Level.

ASSUMTIONS=> I have made 1 CSV file only whose table was shown in the task containing all the information. Since all other information can be easily extracted from that file.

How to Setup

  1. Clone the repository
git clone https://github.com/sagar-sehgal/SwatchBharaturban_Crawler
  1. Make a Virtual Environment
virtualenv venv --python=python3
  1. Activate the virtualenv
source venv/bin/activate
  1. Change the Repository
cd SwatchBharaturban_Crawler
  1. Install the dependencies
pip install -r requirements.txt
  1. Run the Crawler
scrapy crawl swatchbharaturban_crawler

The crawled data would be stored in the swatchbharaturban_crawler/data/swatchbharat_data.csv file.

About

This is a Crawler built in Scrapy to crawl over the https://sbmurban.org/ website. This is the repository that crawls ASP.NET websites using Scrapy using the __VIEWSTATE.


Languages

Language:Python 100.0%