First crawl / returning crawl strategy
eklem opened this issue · comments
Initial
- Define URL to crawl
- Grab TITLE or H1 title of page
- Check if content from previous crawl exists
- If content exists, set previousCrawl === true && find newest article stub && set newestArticleCrawled === timestamp
Repeat
- Crawl content
- Add content to file
- Click
+ Vis flere
-button - Repeat until articleStub timestamp = newestArticleCrawled || no more content (button not clickable any more or doesn't exist anymore.
- Keep everything in memory until the end.
- Do some try/catch on getting content/clicking button or count content length.
- Maybe delete from HTML content already crawled? Page doesn't get too long, easy to figure out what to crawl (all that is there).
For deleting, this works:
var myObj = document.getElementsByClassName("teaser", "widget", "brief", "emphasis-medium", "bulletin")
myobj[0].remove()
That only works partially. A list of the elements are still there.
Tried this, but then it just reloads all elements. So if you delete all elements first time it reloads 10.
var myObj = document.getElementById("live")
myObj.children[0].remove()
More elaborate:
https://stackoverflow.com/questions/4777077/removing-elements-by-class-name
Maybe do a for-loop and remove all similar elements.
fspromise
with async/await
import fs from 'fs';
const fsPromises = fs.promises;
async function listDir() {
try {
return fsPromises.readdir('path/to/dir');
} catch (err) {
console.error('Error occured while reading directory!', err);
}
}
So, seems the limit on news bulletin backlog is 1000 articles:
https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items
If you set the limit to more than a 1000 you get a HTTP 400 Bad request error.
So, this means several things:
- Recurring crawling will be easier.
- Get the JSON above and extract article IDs
- Check which ones are new and crawl. Check against crawled data to be sure. Also add ID to crawled doc objects for easy comparison.
- You can set up tests for news items still being crawlable (similar enough structure) by looking at an old document or two.
- You can set up a test for ensuring that the JSON list of IDs still works okay.
- A test can also tell us when it's time to re-crawl the JSON list and new items. If we don't just do it every month or every second.
For the quality of stopword-training:
- Possibly, we won't have enough data to create a high quality stopword list from the start
- Because of this we should continue to crawl new data when it gets in and re-train the stopword lists.
- And continue to update manually the red-listed words.
So the project will then be a more on-going thing. If we don't ask permission to manually get more content. The risk is getting a no and also get a re-written robots.txt disallowing us to crawl.
crawled
-flag in list of id's set to false. Will be set to true after a craw. Should have a possibility to set a crawl from scratch-flag to true.