Scrapy-Start

A Repo for introducing the scrapy framework.

Run the Project

Steps and Requirements to install and run de project

You will need Pipenv >= 2020.11.15. Otherwise Scrapy install will throw an error.

Clone the project
git clone https://github.com/LucasEduardoRomero/scrapy-start
Create the virtualenv
pipenv shell
Install dependencies
pipenv install
You are ready to go!

Robots

The project consists on three Spiders / Crawlers inside 2 files:

quotes_spider.py

Here we have the file with the Spiders that Crawls the quote.toscrape.com site.

QuotesSpider

This Class is responsible to get the site content, Crawl every quote with text (the quote itself), author and tags. And finally, search for a 'next' button to follow to the next page and repeat the proccess.
Run the Spider. The output will be prompted in terminal
scrapy crawl quotes
Run the Spider and save the content in a json lines file.
scrapy crawl quotes -O quotes.jl

AuthorSpider

This Class is quite similar to QuotesSpider. It opens the same page, search for links to the author's page, open then and passes to another function called parse_author. Then it searchs for 'next' button to go to the next page and repeat the proccess.
The parse_author function receives the response of the request to the author's page, and parse the name, birthdate and the bio (a simple text resuming the author's life)
Run the Spider.
scrapy crawl author -O author.jl

tutorial_spider.py

This File has just one Spider, that crawls the zyte blog page.

PostSpider

This Spider gets every post title (its a link to the post page), request its and forward to parse_post function. Then it searchs for next button and repeat the proccess on the next page.

The parse_post function receives the content from the page, and parse the post_title and the post_first_text (The first paragraph from the post).
Run the Robot
scrapy crawl -O posts.jl

fatos_spider.py

This file has 2 spiders: AosFatosSpider ans AosFatosCrawler. Both spiders scraps all the posts from checked news (tab "Checamos") from aosfatos.org.

AosFatosSpider

Get Home page content and parse all the links on the tab "Checamos", get the request from each one and forward to parse_category function
parsecategory get all the posts from the page, requests each link and forward to _parse_fato. After that, it will look for 'next button', do the request and repeat the proccess
parse_fato function will parse the title, date published, url from the page and EVERY quote and it status (Verdadeiro, Inconclusivo, Falso) and return.

AosFatosCrawler

This spider works with a very similar way to the AosFatosSpider, but instead of define what to do, we just define how to do, with LinkExtractor and Rules.

Finally

Thats It ;)

Any Feedbacks you can:

E-mail lucasromero.cba@hotmail.com
Open an Issue

Thanks

LucasEduardoRomero / scrapy-start