tattle-made / factchecking-sites-scraper

A repo to store helper functions for scraping + experiments/visualisations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Media URL being saved is incorrect

duggalsu opened this issue · comments

Hi,

Issue: Currently, the post/article URL (i.e. *.html files) is being saved for media items i.e. content items (excluding 'text') in the docs: { origURL } field

Affects: the following sites

  • afp
  • digiteye
  • digiteye kannada
  • altnews
  • boomlive
  • factly
  • quint
  • vishvasnews
  • indiatoday

Correct Output: The media URL (eg. *.jpg, *.png...) should be saved in the origURL db document field.

Bug location: The bug exists in get_post_<domain>() functions.

Fix: The fix is pending merge with PR #11

Further required work: Update all documents in the existing database with the correct media URL.