HTTPArchive / bigquery

BigQuery import and processing pipelines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The 2016_11_15_chrome_requests_bodies table has incorrect URLs

foolip opened this issue · comments

Example query:
https://bigquery.cloud.google.com:443/savedquery/762219082167:af96186e5c904f698b123b74869fd98f

For example, https://wiggio.com/images/facebook_home.png (from page http://www.wiggio.com/) shows up amongst the result, with a body containing "OpenTok.js 2.9.3 41dae66" close to the beginning. This appears to be some mixup, and far from the only one.

I don't know if the error is in the original HARs.

@igrigorik

Err.. @pmeenan tracked down and fixed a related issue back in ~August, wonder if we had a regression?

It is a problem in the uploaded HAR and in the HAR from the original test.

Looking into what caused the mis-alignment now. I did fix an issue with something similar (as well as an issue with invalid UTF8 strings) but the UI is showing the correct bodies (which is the earlier fix) so it might be something HAR-specific. Looking into it now.

Made a few changes to hopefully help but won't know for sure until I can look at a newer data set. One issue is that the HARs were always for the first run instead of the median run but that shouldn't have affected the bodies association (just makes it harder to investigate because we only archive the bodies for the median run so I don't have the source data for some of the HARs).

I also switched the HAR export to use the newer ID-based association but I'll need to verify it worked as expected.

Keeping this query here to re-use for later:

SELECT pages.wptid,bodies.page,bodies.url
FROM [httparchive:har.2016_11_15_chrome_requests_bodies] as bodies
JOIN EACH [httparchive:runs.2016_11_15_pages] as pages
ON bodies.page=pages.url
WHERE bodies.url LIKE '%.png'
AND bodies.body CONTAINS 'function'
AND NOT bodies.body CONTAINS 'DOCTYPE'
AND NOT bodies.body CONTAINS 'doctype'
AND NOT bodies.body CONTAINS 'html';

Filtering out the HTML eliminates a lot of cases where "friendly" not-found HTML responses were being sent for image requests and ending with .png helps filter out things like .pngfix.js but still catches a good number of non-png requests (may join with the requests table to check the actual mime type but until then this works well enough).

Thanks for looking into this so quickly, @pmeenan!