Nifoo / CrawlerScrapy

CrawlerScrapy is a Web Crawler aiming at Jobbole.com and Linkedin.com using Scrapy/MySQL/ElasticSearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Web Crawler using Scrapy

  • Applied login simulation, random User Agent and random IP proxy to avoid anti-scraping blacklisted. I finally discarded IP proxy part because the websites I crawled on never blocked me...
  • Implemented 2 spiders in this project, one for collecting tech articles information on www.jobbole.com, the other for collecting users' public profile on www.linkedin.com.
  • Implemented 2 types of database storage here, one is MySQL and the other is Elasticsearch.
  • Implemented a pipeline to parse the user profile from linkedin, generate the inverted index and completion suggester field, finally store into Elasticsearch.
  • Choose LinkedinSpider + LinkedinPipeline + Elastisearch to generate database for a search engine "Lnkn", another project of mine. See: https://github.com/Nifoo/Lnkn

About

CrawlerScrapy is a Web Crawler aiming at Jobbole.com and Linkedin.com using Scrapy/MySQL/ElasticSearch


Languages

Language:Python 100.0%