tattle-made / factchecking-sites-scraper

A repo to store helper functions for scraping + experiments/visualisations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error logging for factchecking scraper

tarunima opened this issue · comments

The v3 of the fact checking scraper has the following modules:

  • article crawler
  • article downloader
  • article parser
  • media downloader
  • data_uploader

At present the scraper stops running if an error is found in downloader, parser or uploader. For some sites, there is some error handling for edge cases. For example, see get_all_images function here: https://github.com/tattle-made/factchecking-sites-scraper/blob/master/scraper_v3/newschecker.py. But the error handling is not systematic.

We need appropriate error handling and logging of errors for the scrapers. This is the functional output:

  • error in article downloader or parser: log the url, error type. Skip running the subsequent modules and move to the next url in url_list.json
  • error in the media downloader: log the url, media doc id, error type. Let the s3 url be null. Proceed to data uploader function for the site. So at the very least, the text of the story and metadata would be uploaded to mongo.
  • error in the data uploader: log the url, error type. Proceed to next url in url_list.json.

Implementing this functionality will require tweaks in the specific functions, as well as the main function. For management of control, after errors in the article downloader/parser stage, in the main function see this suggestion by @RishavT: #53.

All the errors for a scraper of a specific site can be logged in one <site_name>__errors.txt file. The error file can be in the language specific folder. For example, tmp/<site_name>/<lang>/*_errors.txt.