razi-rais / TheguardianScrapper

A Scrapy webscraper that can scrape and store articles of theguardian.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TheguardianScrapper

A Scrapy webscraper that can scrape and store articles of theguardian.com

Installation

Use the package manager pip to install required libraries.

pip install -r requirements.txt

Usage

To start scraping, make sure to create a cluster in MongoDB Atlas and use your connection credentials. Update settings.py:

MONGO_URI = 'Connection URI'
MONGO_DATABASE = 'Database Name'

Then, run the command :

scrapy crawl theguardian

To run the server API use the same credentials for MongoDB in server.py. Then, run the command :

env FLASK_APP=server.py flask run

API

The guardian spider crawls the following data:

Key Type Description
author Array of strings Author(s) of the article.
headline String Headline of the article.
content String The article's content (text only).
standfirst String The article's standfirst (text only).
label Array of strings The article's tags
url String The article's page url.
published_at Date Published date of the article.

The server API provides the following:

GET /articles

Get the list of crawled articles.

  • Path parameters :
Key Type Default value Description
page integer 1 Specify which page to query
num_articles integer 5 Specify number of articles in each page
  • Response :
{ 
  'status' : 'success',
  'page' : 'page number',
  'num_articles_found' : 'the total number of articles queried',
  'num_articles_per_page' : 'the number of articles in each page',
  'results' : [array of items queried]
}

GET /search/(content | headline | author)

Search for articles either keywords in content or headline, or author name.

  • Path parameters :
Key Type Default value Description
page integer 1 Specify which page to query
num_articles integer 5 Specify number of articles in each page
query string empty Pass a text query to search. This value should be URI encoded.
  • Response :
{ 
  'status' : 'success',
  'page' : 'page number',
  'num_articles_found' : 'the total number of articles queried',
  'num_articles_per_page' : 'the number of articles in each page',
  'results' : [array of items queried]
}

Known Issues

  • Article content selectors need improvements.
  • Search regexs need improvements.

TODO

  • Use Readability framework to improve content selector.

About

A Scrapy webscraper that can scrape and store articles of theguardian.com


Languages

Language:Python 100.0%