LucasEduardoRomero / scrapy-start

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


A Repo for introducing the scrapy framework.

Run the Project

Steps and Requirements to install and run de project

  • You will need Pipenv >= 2020.11.15. Otherwise Scrapy install will throw an error.
  1. Clone the project
    git clone

  2. Create the virtualenv
    pipenv shell

  3. Install dependencies
    pipenv install

  4. You are ready to go!


The project consists on three Spiders / Crawlers inside 2 files:

Here we have the file with the Spiders that Crawls the site.

  1. QuotesSpider
  • This Class is responsible to get the site content, Crawl every quote with text (the quote itself), author and tags. And finally, search for a 'next' button to follow to the next page and repeat the proccess.

  • Run the Spider. The output will be prompted in terminal
    scrapy crawl quotes

  • Run the Spider and save the content in a json lines file.
    scrapy crawl quotes -O quotes.jl

  1. AuthorSpider
  • This Class is quite similar to QuotesSpider. It opens the same page, search for links to the author's page, open then and passes to another function called parse_author. Then it searchs for 'next' button to go to the next page and repeat the proccess.

  • The parse_author function receives the response of the request to the author's page, and parse the name, birthdate and the bio (a simple text resuming the author's life)

  • Run the Spider.
    scrapy crawl author -O author.jl

This File has just one Spider, that crawls the zyte blog page.

  1. PostSpider

This Spider gets every post title (its a link to the post page), request its and forward to parse_post function. Then it searchs for next button and repeat the proccess on the next page.

  • The parse_post function receives the content from the page, and parse the post_title and the post_first_text (The first paragraph from the post).

  • Run the Robot
    scrapy crawl -O posts.jl

This file has 2 spiders: AosFatosSpider ans AosFatosCrawler. Both spiders scraps all the posts from checked news (tab "Checamos") from

  1. AosFatosSpider
  • Get Home page content and parse all the links on the tab "Checamos", get the request from each one and forward to parse_category function

  • parsecategory get all the posts from the page, requests each link and forward to _parse_fato. After that, it will look for 'next button', do the request and repeat the proccess

  • parse_fato function will parse the title, date published, url from the page and EVERY quote and it status (Verdadeiro, Inconclusivo, Falso) and return.

  1. AosFatosCrawler
  • This spider works with a very similar way to the AosFatosSpider, but instead of define what to do, we just define how to do, with LinkExtractor and Rules.


Thats It ;)

Any Feedbacks you can:




Language:Python 100.0%