eklem / nrk-sapmi-crawler

Crawler for NRK Sapmi news bulletins that will be the basis for Sami stopword lists and an example search engine for content in Sami.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

First crawl / returning crawl strategy

eklem opened this issue · comments

Initial

  • Define URL to crawl
  • Grab TITLE or H1 title of page
  • Check if content from previous crawl exists
  • If content exists, set previousCrawl === true && find newest article stub && set newestArticleCrawled === timestamp

Repeat

  • Crawl content
  • Add content to file
  • Click + Vis flere-button
  • Repeat until articleStub timestamp = newestArticleCrawled || no more content (button not clickable any more or doesn't exist anymore.
  • Keep everything in memory until the end.
  • Do some try/catch on getting content/clicking button or count content length.
  • Maybe delete from HTML content already crawled? Page doesn't get too long, easy to figure out what to crawl (all that is there).

For deleting, this works:

var myObj = document.getElementsByClassName("teaser", "widget", "brief", "emphasis-medium", "bulletin")
myobj[0].remove()

That only works partially. A list of the elements are still there.
Tried this, but then it just reloads all elements. So if you delete all elements first time it reloads 10.

var myObj = document.getElementById("live")
myObj.children[0].remove()

More elaborate:
https://stackoverflow.com/questions/4777077/removing-elements-by-class-name

Maybe do a for-loop and remove all similar elements.

fspromise with async/await

import fs from 'fs';
const fsPromises = fs.promises;

async function listDir() {
  try {
    return fsPromises.readdir('path/to/dir');
  } catch (err) {
    console.error('Error occured while reading directory!', err);
  }
}

So, seems the limit on news bulletin backlog is 1000 articles:

https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items

If you set the limit to more than a 1000 you get a HTTP 400 Bad request error.
So, this means several things:

  • Recurring crawling will be easier.
  • Get the JSON above and extract article IDs
  • Check which ones are new and crawl. Check against crawled data to be sure. Also add ID to crawled doc objects for easy comparison.
  • You can set up tests for news items still being crawlable (similar enough structure) by looking at an old document or two.
  • You can set up a test for ensuring that the JSON list of IDs still works okay.
  • A test can also tell us when it's time to re-crawl the JSON list and new items. If we don't just do it every month or every second.

For the quality of stopword-training:

  • Possibly, we won't have enough data to create a high quality stopword list from the start
  • Because of this we should continue to crawl new data when it gets in and re-train the stopword lists.
  • And continue to update manually the red-listed words.

So the project will then be a more on-going thing. If we don't ask permission to manually get more content. The risk is getting a no and also get a re-written robots.txt disallowing us to crawl.

crawled-flag in list of id's set to false. Will be set to true after a craw. Should have a possibility to set a crawl from scratch-flag to true.