tattle-made / factchecking-sites-scraper

A repo to store helper functions for scraping + experiments/visualisations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Download Fact Check Data for 2019 in a JSON format with the s3 url modified

dennyabrain opened this issue · comments

Fact check dataset is stored on atlas mongo in a database called factcheck_sites in a collection called stories.

here's an example of one doc from stories

{
  "_id": {
    "$oid": "5db02a1888ab2b22f0a5942e"
  },
  "postID": "1a5ad9881e3e42978bde82472194f000",
  "postURL": "https://www.altnews.in/madhu-kishwar-tweets-photoshopped-image-of-amul-ad-targeting-gandhi-family/",
  "domain": "altnews.in",
  "headline": "Madhu Kishwar tweets photoshopped image of Amul Ad targeting Gandhi family",
  "date_accessed": "October 23, 2019",
  "date_updated": "September 01, 2019",
  "author": {
    "name": "Jignesh Patel",
    "link": "https://www.altnews.in/author/jignesh/"
  },
  "docs": [
    {
      "doc_id": "0719485be1454eb3bc7bbc71adfa17da",
      "postID": "1a5ad9881e3e42978bde82472194f000",
      "domain": "altnews.in",
      "origURL": "https://www.altnews.in/madhu-kishwar-tweets-photoshopped-image-of-amul-ad-targeting-gandhi-family/",
      "s3URL": null,
      "possibleLangs": [],
      "isGoodPrior": [
        {
          "$numberInt": "0"
        },
        {
          "$numberInt": "0"
        }
      ],
      "mediaType": "text",
      "content": "“बिना कुछ कहे, सब कुछ कह दिया। (Says everything without saying anything -translated)”, reads a tweet by academician and writer Madhu Purnima Kishwar with a photograph of billboard, which showed the trademark Amul girl along with cartoons of Congress leader Rahul Gandhi and Priyanka Gandhi. The billboard had an inscribed message that was targeted at the dynasty politics and allegation of corruptions by the Gandhi family. The message reads, “नाना ने खाया , दादीने खाया , पापा ने खाया , मम्मी ने खाया आओ बहना तुम भी खालो जीजू को भी यहाँ बुला लो (Grandfather ate, Grandmother ate, Father ate, Mother ate and Sister you also eat and also call brother-in-law -translation)”, reads the Hindi text in the billboard. The word ‘खाया (ate)’, contextually refers to corruption.\nPhotoshopped image\nAlt News found that the Ad banner used in the billboard is photoshopped with the Hindi text stated above. A reverse search of the image on Google reveals that there are several images of the same car and billboard with different banners, which goes to suggest that it has been photoshopped.\nMoreover, the Amul Ad banner, which comprises of the iconic Amul girl, Priyanka Gandhi and Rahul Gandhi has also been photoshopped with the Hindi text and the Amul slogan (The Taste of India). Amul had tweeted the original photo of the Amul Ad dedicated to Priyanka Gandhi’s entry into active politics before the 2019 general elections.\nBoom also spoke to daCunha, the advertisement agency behind the campaign and confirmed, “the viral photo was fake and it didn’t come from the agency.” \nIn conclusion, academician Madhu Kishwar tweeted a photoshopped image of Amul Ad, which targeted the Gandhi family portraying it as the official advertisement from the company. In the past as well, the writer and academician has been found spreading misinformation on several occasions (1, 2, 3, 4). Last December, Kishwar tweeted an old video of a rally with a false claim that Pakistani flags were waved by the Muslim community in celebration of the Congress’ electoral victory in three assembly elections. When it was pointed out that she had posted misinformation, she tweeted another misleading video to defend her last tweet.\nDonate Now\nEnter your email address to subscribe to Alt News and receive notifications of new posts by email.\nSend this to a friend",
      "nowDate": "October 23, 2019"
    },
    {
      "doc_id": "5eb1d53cfc7c4c8db993f9d7428a9b76",
      "postID": "1a5ad9881e3e42978bde82472194f000",
      "domain": "altnews.in",
      "origURL": "http://www.altnews.in/wp-content/uploads/2019/08/2019-08-31-22_54_36-Google-Search.png",
      "s3URL": "REMOVED_TO_PREVENT_ABUSE",
      "possibleLangs": [],
      "isGoodPrior": [
        {
          "$numberInt": "0"
        },
        {
          "$numberInt": "0"
        }
      ],
      "mediaType": "image",
      "content": null,
      "nowDate": "October 23, 2019",
      "onPortal": true
    }
  ]
}

This doc represents an article on a fact check website. For each article/story you have a docs array with each item being a media element. So it could be a text portion, image or video from the article.

You will notice that in the example above, for a doc whose mediaType is image, the s3url looks like "s3URL": "REMOVED_TO_PREVENT_ABUSE",
When you connect to our database, You will see a real url to our s3 bucket. Of the form https://name-of-our-bucket.ap-south-1.amazonaws.com/OBJECTID

When we release this dataset, we want to replace the access the files via a custom file server (which acts as a proxy for the s3 bucket), rather than directly sharing our s3 url with users. This is only applicable to doc with mediaType image and video and not for text.

the url for our file server will take the for https://fs.tattle.co.in/service/factcheck/file/OBJECTID (Note that this endpoint is not functional as of now but will be soon.)

The objective is to produce a JSON file with this modified S3 URL path.

Please collect db credentials from one of the tattle admins (admin@tattle.co.in)