HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community

Home Page:https://almanac.httparchive.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add June 2022 data to the almanac dataset

rviscomi opened this issue · comments

The 2022_06_01 crawl is complete and we have the raw [pages, requests, response bodies, lighthouse] tables available on BigQuery. To make previous years' queries compatible with the June crawl, we need to add it to the following tables in the almanac dataset:

These are no longer required (we moved to custom metrics last year but ran them one last time to allow comparison):

To clarify, should we still generate the 2022 data for these tables for comparison, or is it not needed because the data is already in custom metrics?

Don’t plan on doing the comparison this year (it was very messy last year as so different so wouldn’t advise anyone else to do it either!) so not needed this year AFAIK. Let’s make a clean break.

Sounds good, thanks. Removed them from the checklist.

@patrickhulce the latest table in lighthouse-infrastructure.third_party_web I see is dated 2022_01_01. Is it possible to get an updated version based on 2022_06_01 HTTP Archive data?

We're holding off on this because we discovered a data quality issue with the crawl

Resuming this now that the June crawl has been successfully rerun

Ping @patrickhulce about getting an updated lighthouse-infrastructure.third_party_web.2022_06_01 table.