samparsky/web-crawler

Simple Web Crawler

This is a simple web crawler that crawls link. Parses through the results page. It works based on the (Scrapy)[https://scrapy.org/] crawling engine. Its uses Extruct to parse application/ld+json content of the pages to retrieve basic content and Xpath to query the

To start

pip install -r requirements.txt

To run the crawler

 cd <directory>
 scrapy crawl wizard

MongoDB

The mongodb collection schema is as follows

    event_name  
    description 
    age_group    
    location     
    price        
    link		 
    event_link 
    date

The mongodb database is mommy and the collection is crawl To view the crawled data run the below commands at the mongo shell

 > use mommy
 > db.crawl.find()

About

This a site crawler built with scrapy and stores data generated in mongodb using scrapy

scrapy scrapy-demo extruct

Languages

Language:Python 100.0%