Fetch well-known URLs

Question

Fetch well-known URLs

nrllh opened this issue 3 years ago · comments

I think we should discuss including well-known URLs (e.g., robots.txt, ads.txt, security.txt, etc.) because I see two problems there;

Even if one of these URLs is being fetched by our crawler, not all people know about that (I was noticed that last year security.txt was used by the SEO chapter, and we actually could also use it for our security chapter, we did not because we (or I 😊) didn't know it.)
Fetching these URLs may cost more time, but I think it's in general essential to analyze these URLs and enrich our analysis. We can fetch these files once in a year for the Almanac.

My suggestion is to collect interesting URLs in this issue so all contributors know which URLs are being fetched additionally.

I found a list but it doesn't include all URLs: https://en.wikipedia.org/wiki/List_of_/.well-known/_services_offered_by_webservers - for example, manifest.json, hackers.txt are missing in that list.

So what do you think?

Nurullah Demir · Answer 1 · Wed May 26 2021 21:01:16 GMT+0800 (China Standard Time)

For the security chapter security.txt, hackers.txt and robots.txt would be interesting.

name	path
robots.xt	/robots.txt
security.txt	/.well-known/security.txt
hackers.txt	/hackers.txt

cc @SaptakS @tomvangoethem

Saptak Sengupta · Answer 2 · Thu May 27 2021 00:09:51 GMT+0800 (China Standard Time)

I think these URLs will be really helpful in getting more interesting analysis. Totally support this.

Rockey Nebhwani · Answer 3 · Thu May 27 2021 03:53:09 GMT+0800 (China Standard Time)

@nrllh - Last year, I tried to do this for eCommerce chapter by looking at following well-known URLs

.well-known/assetlinks.json
.well-known/apple-app-site-association

My commit from last year is in custom_metrics/ecommerce.js (https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/ecommerce.js)

I was trying to find out how many eCommerce sites have Android/iOS app and using these standards to declare the app association. I didn't end up including any insights in eCommerce chapter as I ran out of time and for some platforms, I was getting empty assetlinks.json file so in order to get to something meaningful for the chapter, I needed to further parse the content of the file and just detecting the presence of file was not enough.

Tagging eCommerce 2021 team if the team want to pick up this - @bobbyshaw @rrajiv

Rick Viscomi · Answer 4 · Thu May 27 2021 05:48:56 GMT+0800 (China Standard Time)

One issue with WPT is that fetching additional URLs in the custom metric does not necessarily make their requests/responses available in the network log that makes up the requests and response_bodies tables. So we would either have to dump the response bodies in the output of the custom metric or only output some summary statistics about the file. The latter is preferable because we don't know how long some of these files will be and we don't want to bloat the HAR file (made available in the pages tables). @nrllh could you clarify which approach you're proposing?

cc @pmeenan in case there's a way to make custom metric requests visible in the requests/bodies.

Barry Pollard · Answer 5 · Thu May 27 2021 05:54:46 GMT+0800 (China Standard Time)

I actually quite like the fact it doesn’t appear in requests/bodies. Means you’re not polluting the real page load data. Otherwise number of requests goes up by one for each requested additional load, and number of 404s could sky rocket as most sites won’t have a lot of these URLs.

Do agree we should ideally do the processing in the custom metric though and only save summary results back, rather than full file.

Patrick Meenan · Answer 6 · Thu May 27 2021 21:22:17 GMT+0800 (China Standard Time)

Yeah, agree with @tunetheweb - they aren't part of the page load, they shouldn't be in the main requests data.

A few things to watch out for:

Make sure to have aggressive timeouts on the fetches so we don't stall the crawl if a lot of sites timeout the requests.
If possible, fetch everything async, do the processing and then await all so they can run in parallel (preferably with all of the fetches in a single custom metric or a small number of them because the custom metrics are serialized).
Watch out for storing/processing responses for non-200 responses in the case of a friendly 404 page being returned.
If the responses may be big, storing the full response will bloat the page data table/queries.

On the last point, we could add more processing to the HARs if we want to store response bodies but prune them out of the page data into the bodies tables. We could have a well-known metric name that includes the file name. Something like "response-body-security.txt" and then the HAR processing could prune out anything that starts with response-body.

Nurullah Demir · Answer 7 · Mon May 31 2021 17:43:43 GMT+0800 (China Standard Time)

There is a lot to explore in these files. In robots.txt, it'll be interesting to analyze (for security chapter) potential exploitation (e.g., secret login links). In security.txt we can check which popular reporting methods are being used.

That's why I think providing the response body of these files would be better than providing some statistics. Otherwise, it could also be a limitation for future analysis. Of course, if it doesn't cause an overhead.

Gertjan Franken · Answer 8 · Tue Jun 29 2021 01:44:54 GMT+0800 (China Standard Time)

To be completely sure: currently the consensus is for analysts to use custom metrics to collect information on the content of .well-known URLs, right? Or is this information going to be included in the crawl dataset, such that it will be available by query?

Just want to avoid redundant data/work :)

Rick Viscomi · Answer 9 · Tue Jun 29 2021 04:26:07 GMT+0800 (China Standard Time)

@GJFR yes custom metrics are the preferred approach for this but the window to get it in before the July crawl is closing quickly. Per @pmeenan's suggestion, we should combine any custom metrics that rely on external fetches so that they can be parallelized and share the same timeout logic. So whoever implements this should extend ecommerce.js and rename it to something more generic like well_known.js or external_resources.js. I would still discourage returning the entire contents of the resource and opt for more specific/aggregatable summary stats instead.

Gertjan Franken · Answer 10 · Tue Jun 29 2021 07:30:55 GMT+0800 (China Standard Time)

I've extended and renamed ecommerce.js to well_known.js in HTTPArchive/legacy.httparchive.org@2a441a0.

It should be easily extendible for other well-known URLs and external sources by just adding parseResponse calls while passing the desired URL, and -- if required -- a parser function.