liip / TheA11yMachine

I randomly picked one website to understand how crawling works for a11ym. I observed every time the maximum URLs crawled was different.
$ a11ym https://www.drupal.org/

First Run: 128/128 URLs was computed.
Second Run : 123/128 URLs was computed
Third Run: 105/128 URLs was computed.

Note: Value for "maxURLs" and "maxDepth" was unaltered at every execution and by default taking value "128" and "3" resp.

The algorithm which computes URL is unordered and asynchronous.

Let me explain how it works.

A list of URL is given, and enqueued (in queue A),
For each URL, it is opened, parsed, and URL present in the document are enqueued (in queue A),
URL in queue A are dequeued one at a time… or almost. Actually you can specify a number of workers: It represents the number of maximum URL that can be dequeued and computed at the same time.
The queue A is testQueue in the code. Each URL in this queue aims at being tested by the tester module. So basically, everything inside this queue is a test candidate.

Now, how the crawler works exactly? We said that for each URL, it is opened and parsed. And that URL present in the document are enqueued. Great. But how URL are enqueued? In which order? To provide more significant results as fast as possible, URL are not put in queue A directly. Why? Imagine a menu called Products. The first items in the menu could lead to the “same” page, i.e. productd of the same kind, so probably with the same HTML (modulo product information and details). Our goal is to check very different pages as fast as possible.

So when a URL is opened and parsed, all URL are enqueued in different sub queues. To simplify, say we have several queues: queue B_i, where i is the name of the queue. We call this name a bucket. For instance, let be 3 URL: /foo/bar, /foo/baz, and /qux/hello, you will have: queue B_foo = [/foo/bar, /foo/baz], and queue B_qux = [/qux/hello].

URL in queues queue B_i are dequeued all asynchronously. When a URL is dequeue, it is opened, parsed, and new URL are extracted and pushed in queue B_i. Per queue B_i, only one URL is computed at the same time. It means that /foo/baz is computed only and only if /foo/bar has been totally computed.

When a URL is opened, it is automatically added to queue A.

Consequently, queue A is likely to not receive /foo/bar, /foo/baz, and /qux/hello but more likely something like /foo/bar, /qux/hello, and /foo/baz. Or maybe /qux/hello, /foo/bar, /foo/baz. Each URL can add more URL queue B_i, and more queue B_j can be created (for instance queue B_about for /about/us). queue B_about has no priority over queue B_foo or queue B_qux but when these latters are being computed, queue B_about has time to add its first item into queue A.

So. Your results are not identical because the algorithm behavior is undeterministic. At the end, given time, all URL must be computed, but not in the same order and time. Especially if the limit of maximum URL to compute is low.

In your example, this is not normal that the run stops at 123 or 105 if 128 URL have been found the first time. Maybe there is a special URL that makes the crawler to stop. This special URL is never met the first time, is met at position 123 the second time, and at position 105 the third time. That's my guess.

Do you have an idea about which URL it could be?

Hi Ivan,

Thank you so much for detailed insights about crawling algo, it really helped to make an understanding.

Do you have an idea about which URL it could be?
I ran a11ym again on https://www.drupal.org/ around five times. In just one run, crawling stopped before 128 (Max URL limit) and in other four run the maximum URLs was reached ( that might be due to the culprit URLs hasn't been encountered yet).

Culprit URL: https://www.drupal.org/promet-source
So, in order to re-confirm that URL is actually culprit or not, I hit it again explicitly and crawling continued on it.
$ a11ym https://www.drupal.org/promet-source

As per my understanding, if the disruption is caused by above mentioned URLs, then the URL should not be crawled again by the tool. Not sure, why the crawling stopped at this particular URL only. Please guide.

Also, can we track this disruption in some log?

Thanks & Regards,
Sparshi Dhiman

I am sorry but I cannot reproduce. I have been able to successfully generate a report for 128 URLs 3 consecutive times. What is your OS? What is your version of PhantomJS? What is your version of NodeJS?

OS Version : MacOS Sierra 10.12.1
PhantomJS Version : 2.1.1
NodeJS Version: v6.10.3
NPM Version: 3.10.10

Hi Ivan,

I agree that there’s no regular pattern for crawling issue, which in fact makes it difficult to trace down the actual root cause.

I have attached a PDF file to show how the issue is occurring at my end.

I picked another small website “http://zomig.com/” to test.
This time I changed the maximum URL limit to 500 i.e. a11ym http://zomig.com/ -m 500, just to be sure the tool is parsing all the URLs of website in any execution.

In total, I performed four executions and observations with valid screenshots are provided in attached PDF. Please find the same and kindly provide your inputs.
A11ym-CrawlingQuery.pdf

Regards,
Sparshi Dhiman

Please try with NodeJS 7.x. This is likely to be NodeJS crashing because it is NodeJS.

Thankyou Ivan but upgrading NodeJs to 7.x didn't solve the issue.

I'm going to take a guess here. I've found that the check below causes the tool to quit if the queue isn't populated fast enough or at a consistent rate. So anytime the queue drops to 0, no matter if it reached the maximum urls, it will quit.

TheA11yMachine/lib/a11ym.js

Lines 220 to 222 in fb6acb8

    
           if (0 === testQueue.running()) { 
        
               quit(); 
        
           }

You can comment out that check, however if the crawler doesn't reach your maximum URLs the queue will remain open and you will have to manually quit.

Hope this helps.

UPDATE:
async has two listeners that you could probably leverage.

drain - a callback that is called when the last item from the queue has returned from the worker.
testQueue.drain = function() { quit(); };

empty - a callback that is called when the last item from the queue is given to a worker.
testQueue.empty = function() { quit(); };

Docs

Maximum URLs to compute is different even with same configuration.