First crawl / returning crawl strategy

Question

First crawl / returning crawl strategy

eklem opened this issue 3 years ago · comments

Espen Klem commented 3 years ago

Initial

Define URL to crawl
Grab TITLE or H1 title of page
Check if content from previous crawl exists
If content exists, set previousCrawl === true && find newest article stub && set newestArticleCrawled === timestamp

Repeat

Crawl content
Add content to file
Click + Vis flere-button
Repeat until articleStub timestamp = newestArticleCrawled || no more content (button not clickable any more or doesn't exist anymore.

Espen Klem · Answer 1 · Mon Dec 20 2021 03:12:26 GMT+0800 (China Standard Time)

Keep everything in memory until the end.
Do some try/catch on getting content/clicking button or count content length.
Maybe delete from HTML content already crawled? Page doesn't get too long, easy to figure out what to crawl (all that is there).

Espen Klem · Answer 2 · Mon Dec 20 2021 03:33:23 GMT+0800 (China Standard Time)

For deleting, this works:

var myObj = document.getElementsByClassName("teaser", "widget", "brief", "emphasis-medium", "bulletin")
myobj[0].remove()

That only works partially. A list of the elements are still there.
Tried this, but then it just reloads all elements. So if you delete all elements first time it reloads 10.

var myObj = document.getElementById("live")
myObj.children[0].remove()

More elaborate:
https://stackoverflow.com/questions/4777077/removing-elements-by-class-name

Maybe do a for-loop and remove all similar elements.

Espen Klem · Answer 3 · Mon Dec 20 2021 03:35:00 GMT+0800 (China Standard Time)

fspromise with async/await

import fs from 'fs';
const fsPromises = fs.promises;

async function listDir() {
  try {
    return fsPromises.readdir('path/to/dir');
  } catch (err) {
    console.error('Error occured while reading directory!', err);
  }
}

Espen Klem · Answer 4 · Mon Dec 20 2021 16:59:56 GMT+0800 (China Standard Time)

So, seems the limit on news bulletin backlog is 1000 articles:

https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items

If you set the limit to more than a 1000 you get a HTTP 400 Bad request error.
So, this means several things:

Recurring crawling will be easier.
Get the JSON above and extract article IDs
Check which ones are new and crawl. Check against crawled data to be sure. Also add ID to crawled doc objects for easy comparison.
You can set up tests for news items still being crawlable (similar enough structure) by looking at an old document or two.
You can set up a test for ensuring that the JSON list of IDs still works okay.
A test can also tell us when it's time to re-crawl the JSON list and new items. If we don't just do it every month or every second.

For the quality of stopword-training:

Possibly, we won't have enough data to create a high quality stopword list from the start
Because of this we should continue to crawl new data when it gets in and re-train the stopword lists.
And continue to update manually the red-listed words.

So the project will then be a more on-going thing. If we don't ask permission to manually get more content. The risk is getting a no and also get a re-written robots.txt disallowing us to crawl.

Espen Klem · Answer 5 · Tue Dec 28 2021 22:44:12 GMT+0800 (China Standard Time)

crawled-flag in list of id's set to false. Will be set to true after a craw. Should have a possibility to set a crawl from scratch-flag to true.