Bucketting is only partially working.

Question

Bucketting is only partially working.

krtek4 opened this issue 9 years ago · comments

The tester part of the bucket systems works great, but the crawling is not 100%.

The crawler fetches (and de factor add) URLs based on the order they appear in the document, this means if the first thousand URLs of a document all go to the same bucket, we will first fetch those 1000 URLs before filling any other bucket.

This is cumbersome when you want to quickly have a panel of different URLs tested.

I can imagine two solutions :

change the fetching order of the crawler (maybe not feasible)
do not use the event 'fetchcomplete' but 'discoverycomplete' or 'queueadd' which are fired earlier in the process (no fetching of the URLs are made at this point), so we will have all URLs of the first page enqueued without having to wait of fetching them.

Gilles Crettenand · Answer 1 · Fri Jan 29 2016 17:16:52 GMT+0800 (China Standard Time)

Small hint, it could be possible to move our bucket implementation to replace Crawler.prototype.queueURL so that we decide in which order SimpleCrawler fetches URLs.

Then, once a URL is fetched and validated, we can pass it directly to pa11y because it will be in the right "order" already.