HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community

Home Page:https://almanac.httparchive.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Structured Data 2022

rviscomi opened this issue Β· comments

Structured Data 2022

Structured Data illustration

If you're interested in contributing to the Structured Data chapter of the 2022 Web Almanac, please reply to this issue and indicate which role or roles best fit your interest and availability: author, reviewer, analyst, and/or editor.

Content team

Lead Authors Reviewers Analysts Editors Coordinator
@DataBytzAI @cyberandy @DataBytzAI @JohnBarrettWDW @SeoRobt @jasonbellwebdataworks @jonoalderson @rviscomi @JasmineDW @siakaramalegos
Expand for more information about each role πŸ‘€
  • The content team lead is the chapter owner and responsible for setting the scope of the chapter and managing contributors' day-to-day progress.
  • Authors are subject matter experts and lead the content direction for each chapter. Chapters typically have one or two authors. Authors are responsible for planning the outline of the chapter, analyzing stats and trends, and writing the annual report.
  • Reviewers are also subject matter experts and assist authors with technical reviews during the planning, analyzing, and writing phases.
  • Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.
  • Editors are technical writers who have a penchant for both technical and non-technical content correctness. Editors have a mastery of the English language and work closely with authors to help wordsmith content and ensure that everything fits together as a cohesive unit.
  • The section coordinator is the overall owner for all chapters within a section like "User Experience" or "Page Content" and helps to keep each chapter on schedule.

Note: The time commitment for each role varies by the chapter's scope and complexity as well as the number of contributors.

For an overview of how the roles work together at each phase of the project, see the Chapter Lifecycle doc.

Milestone checklist

0. Form the content team

  • May 1: The content team has at least one author, reviewer, and analyst

1. Plan content

  • May 15 The content team has completed the chapter outline in the draft doc

2. Gather data

  • June 1: Analysts have added all necessary custom metrics and drafted a PR (example) to track query progress
  • June 1 - 15: HTTP Archive runs the June crawl

3. Validate results

  • August 1: Analysts have queried all metrics and saved the output to the results sheet

4. Draft content

  • September 1: The content team has written, reviewed, and edited the chapter in the doc

5. Publication

  • September 15: The completed chapter and all required metadata and figures are converted to markdown and submitted to GitHub
  • September 26: Target launch date πŸš€

Chapter resources

Refer to these 2022 Structured Data resources throughout the content creation process:

πŸ“„ Google Docs for outlining and drafting content
πŸ” SQL files for committing the queries used during analysis
πŸ“Š Google Sheets for saving the results of queries
πŸ“ Markdown file for publishing content and managing public metadata
πŸ’¬ #web-almanac-structured-data on Slack for team coordination

Hi @rviscomi happy to contribute as author πŸ€“

Hi @rviscomi happy to help as author .... lot of work to be done in this space!

Would love to lead this one again!

@jonoalderson please do πŸ™

Hello @rviscomi I'd like to be an editor!

Hey everyone, would be thrilled to participate as a reviewer!

@jonoalderson @cyberandy would be great to have a call for intros and planning ... esp since you folk have been there before :D

Hey @DataBytzAI yep, sounds like a plan! Carved out some time later this week to start getting things in motion πŸ‘

Hello! I spend my days hands-on with semantic web technologies; particularly defining, describing, and linking structured data in a distributed KG system. These are also topics I speak about at tech conferences. If there is any room to participate as an author, I would be honored to be considered.

If not, perhaps I can add value in another role. Either way, I thank you for the consideration :)

@jonoalderson when suits Today/Thur/Fri for an intro call? .... I am London/Dublin time zone ... anyone else available to call?
@cyberandy @carducci ?

sorry @JasmineDW @siakaramalegos @SeoRobt meant to mention you as well to join call if possible!!

@DataBytzAI I'm available almost anytime tomorrow after 4:00PM London/Dublin time.

Friday I can connect anytime after 2:00PM London/Dublin time.

Thanks @DataBytzAI great to e-meet you! I'm available anytime tomorrow or Friday, currently in Central Standard Time zone.

Hey folks, any time Friday afternoon works for me.
Shall we aim for 4pm UTC on Friday?

Structured data call Friday 16:00 UTC (17:00 London time), good for everyone else?
@cyberandy @carducci @JasmineDW @siakaramalegos @SeoRobt

cc: @jonoalderson

Assuming the time works for everyone, I have created a zoom room.

Michael C (He/Him) is inviting you to a scheduled Zoom meeting.

Topic: Structured Data 2022
Time: Apr 29, 2022 04:00 PM Universal Time UTC

Join Zoom Meeting
https://us06web.zoom.us/j/81102069402?pwd=UDBqclNMWWtZK0tVdzBtTUY1ajBDUT09

Meeting ID: 811 0206 9402
Passcode: 312040
One tap mobile
+12532158782,,81102069402#,,,,*312040# US (Tacoma)
+13462487799,,81102069402#,,,,*312040# US (Houston)

Dial by your location
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
+1 669 900 6833 US (San Jose)
+1 301 715 8592 US (Washington DC)
+1 312 626 6799 US (Chicago)
+1 929 436 2866 US (New York)
Meeting ID: 811 0206 9402
Passcode: 312040
Find your local number: https://us06web.zoom.us/u/kcti4pHfFj

@cyberandy @DataBytzAI @JasmineDW @SeoRobt @siakaramalegos

cc: @jonoalderson

Hey @carducci thanks for setting that up. Confirming that the meeting will be at 6 UTC? The invite shows 4 UTC. Talk tomorrow!

Structured data call Friday 16:00 UTC (17:00 London time), good for everyone else?

Are we moving to 18:00 UTC? Happy to update the meeting but I want to confirm

Structured data call Friday 16:00 UTC (17:00 London time), good for everyone else?

Are we moving to 18:00 UTC? Happy to update the meeting but I want to confirm

fine here (7pm London time)

I'm flexible.

Sincere apologies @carducci, the Zoom invite for 16:00 UTC is correct and works for everyone.

I'll be there, though annoyingly I have a hard stop for another call at 16:30 UTC. :(

commented

Hey folks, on your call, can you also discuss roles? Jono volunteered to be a lead though the Almanac does like to switch these up each year if possible. So let us know if anyone else wants to volunteer for that and then Rick can make a final decision.

You also have a lot of co-author volunteers but not as many reviewers. Reviewers also give feedback/guidance at the outline stage so your role is not only at the very end but influences content. So if you think you might have less overall time to give, then being a reviewer instead could be an option.

Don't forget to join the Slack (link in issue above)!

commented

Oh you still need an analyst too so if you could help recruit for that, it would be great. If someone knows at least some SQL we can get them trained up for the rest. And the queries are done in the almanac account so they won't be charged for it. It's a really cool data set to explore for anyone interested.

So let us know if anyone else wants to volunteer for that

Hi @siakaramalegos ... happy to volunteer for lead if Rick wants to rotate that role!

Nice one, that'd be great! I'm very happy to step into some author shoes.

REMINDER: CALL IN 5 MINUTES:

Michael C (He/Him) is inviting you to a scheduled Zoom meeting.

Topic: Structured Data 2022
Time: Apr 29, 2022 04:00 PM Universal Time UTC

Join Zoom Meeting
https://us06web.zoom.us/j/81102069402?pwd=UDBqclNMWWtZK0tVdzBtTUY1ajBDUT09

Meeting ID: 811 0206 9402
Passcode: 312040
One tap mobile
+12532158782,,81102069402#,,,,*312040# US (Tacoma)
+13462487799,,81102069402#,,,,*312040# US (Houston)

Dial by your location
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
+1 669 900 6833 US (San Jose)
+1 301 715 8592 US (Washington DC)
+1 312 626 6799 US (Chicago)
+1 929 436 2866 US (New York)
Meeting ID: 811 0206 9402
Passcode: 312040
Find your local number: https://us06web.zoom.us/u/kcti4pHfFj

Tentatively adding myself as the analyst to help meet Milestone 0.

Just following up. When can get all get together for that mind-mapping session? Does the same time as last Friday work for folks on the 6th?

I'm also travelling this Friday, free next week

@cyberandy @DataBytzAI @jonoalderson @SeoRobt @JasmineDW could you all make sure you have access to the planning doc and start adding your ideas to the outline? What's new this year with structured data, or what would be good to revisit from previous years? We're trying to have that completed by May 15 in order to have enough time to add any necessary custom metrics before the June crawl kicks off. Thanks!

@cyberandy @jonoalderson @SeoRobt @JasmineDW @carducci
Can we plan a call tomorrow, Monday to catch-up and rocket-ship our end of things?

Actions we need to take:

@carducci kindly offered to host a mind-map session - this would be of great benefit to bring things together quickly

All present on our last call also agreed they can contribute a brain-dump on their particular area of interest/expertise to help us kick-start content planning .... could everyone find 15 minutes to do this asap?

FYI I also had a call with Rick about the availability of deeper site info on structured data and will update folks during our call.

@cyberandy @jonoalderson @SeoRobt @JasmineDW @carducci

Can you all please go to the outline document and add your contact details and role (as in the head of this thread!) to confirm participation?

Thanks :)
https://docs.google.com/document/d/1yHTdPPvpv380BLQWHo2aUWepOUBpTlfkx1O9zey8-5I/edit#

Sure works for me, I'll add my comments to the doc today.

Done. Provisionally shifting myself into an editor role, given that I have less availability than expected. Still here and engaged, but can't commit to quite as much active writing as anticipated.

Sure you don't mean Reviewer (check technical accuracy and also help suggest and review content) rather than Editor (check for typos, formatting and consistency with rest of Web Almanac).

Reviewers will of course do some parts of editing as part of reviewing but Reviewers are technical experts in the subject matter, while Editors aren't necessarily.

Yes. Evidently more coffee needed this morning.

You're not the first (and won't be the last!) to confuse those roles! Need to think of a better way of making them clearer...

Done. Provisionally shifting myself into an editor role, given that I have less availability than expected. Still here and engaged, but can't commit to quite as much active writing as anticipated.

@jonoalderson would you be able to squeeze in a few minutes to do a brain-dump on things you would like to see in this years content?

Yeah, definitely!

Hello! I would love to contribute as a reviewer or as an editor if you are still looking for contributors!

Hello! I would love to contribute as a reviewer or as an editor if you are still looking for contributors!

Awesome! .... lets get you connected up ..... please go to the outline document and add your name in the role you'd prefer!

https://docs.google.com/document/d/1yHTdPPvpv380BLQWHo2aUWepOUBpTlfkx1O9zey8-5I/edit#

You will have to request access first :)

Thanks @aparna-garimella !

Hello! I would love to contribute as a reviewer or as an editor if you are still looking for contributors!

Awesome! .... lets get you connected up ..... please go to the outline document and add your name in the role you'd prefer!

https://docs.google.com/document/d/1yHTdPPvpv380BLQWHo2aUWepOUBpTlfkx1O9zey8-5I/edit#

You will have to request access first :)

Thanks @aparna-garimella !

Thank you!

@jonoalderson would you be able to squeeze in a few minutes to do a brain-dump on things you would like to see in this years content?

Some semi-structured thoughts:

  • I'd love to see more on the relationships between entities. We scratched the surface of this with the sankey chart, but that was rough and ready (schema.org only, artificially ignored chained relationships, etc). It feels like some version of this could/should be the most important, understandable, and significant part of the output if 'done right'.
  • I'd love to understand more about what the (types of) things we're describing are, and also, where they 'live'. Feel like @cyberandy might have some good insight into that from WordLift.
  • E.g., is schema.org person is common in the dataset, who is this describing? What kinds of person(s), or specific individuals? Etc etc, across various formats and data types.
  • I'd love more insight into who/what is consuming structured data, beyond just Google/Facebook/etc; even if only anecdotally.
    • I'd love to see examples of where structured data on the web is joining up with structured data behind the scenes; e.g., in business processes, analytics, etc.
    • Who's doing cool stuff on the front end? E.g., using for internal site search.
  • What's changed with/from those big providers? How are they driving adoption?
    • How are they distorting adoption? E.g., Google's "carousel" schema implementation is, IMO, a harmful bastardization(!) of itemList schema that pursues its own agenda at the expense of having 'good' structure data and definitions. Lots of other examples of this, but mostly just grips with how Google make their own rules in this space.
  • I'd love to understand how much of the structured data is 'junk'/nonsense. E.g., there are lots of WordPress sites/themes which output either malformed e-commerce schema.org markup, or worse still, blindly add hentry classes to container elements with no rhyme/reason. How much of what's out there is just noise?

Awesome feedback @jonoalderson! Just to elaborate on a few points:

  1. relationships between entities we shall do better πŸ’― agree it's a good starting point, if the data is ready we could also make the chart interactive 😎. Last year I used RAWGraphs, that is built on top of D3.js

  2. what's inside these classes yes, I was thinking about adding some NLP - this would be particularly useful for descriptive types like FAQPage. Once we have an overview of the available data we can dive deeper. Also interesting would be the correlation between entity types & chains with sectors

  3. for the consuming part the most frequent use cases that I can think of are:

    • internal search engines
    • recommendation engines (and internal links)
    • web reporting
    • natural language generation (and ML training in general)

    We need to find a nice ways to explore and present these use-cases.

What about a mind-mapping session some time tomorrow (wednesday, May 11th) between 5:30 and 7:00 UTC?

I like the idea of identifying consumers. Here's some I keep an eye on:

Google Merchant Centre: Structured Data is used to verify, update or event create a feed. They use inProductGroupWithID and additionalProperty properties to identify products. Some of the Structured Data they support clashes with what Google Search support.
https://support.google.com/merchants/answer/6386198?hl=en&ref_topic=6386199

Facebook Ads: Their pixel reads product structured data which can update a catalog. They use the productID and additionalProperty
parameters to identify products.
https://developers.facebook.com/docs/marketing-api/catalog/guides/microdata-tags

Pinterest: Structured Data for product pins
https://developers.pinterest.com/docs/rich-pins/product/

Twitter: Basic meta tags for cards.
https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/markup

@carducci I'm available between 5:30 - 6:30 UTC

Oh, something I keep running into that @Tiggerito hints at, that might be interesting to explore as a brief footnote...
There's no clean/sensible/performant way of reconciling the combinations of requirements from this set of consumers.

E.g., there's no good approach which describes a product with variations (e.g., size/colour) in a way that's valid for Google Search, Google Merchant Center, Facebook (og: or Marketplace). Optimizing for more than one creates irreconcilable conflicts in the approach/expectations.

Fun stuff.

@cyberandy @SeoRobt @JasmineDW @aparna-garimella @carducci
cc: @jonoalderson

How is everyone for a call/mind-mapping session TODAY? ... circa 18:00 London time?

Tied up unfortunately :(

@cyberandy @SeoRobt @JasmineDW @aparna-garimella @carducci

Ok to move to 18:00 London time with you all?

@DataBytzAI I'm good for 18:00 London time

Likewise I am tied up today, unfortunately :(.

sorry I was in another meeting by then, shall we set a bi-weekly meeting around that time?

OUTLINE UPDATE

@cyberandy @carducci @JasmineDW @siakaramalegos @SeoRobt @jonoalderson @michaelcary

Hi folks,

I have gathered all of the comments and rejigged the chapter outline to hopefully reflect all of the discussion thus far.
It would be great if you could please take a few minutes and go through it and comment! .... in particular, see the comment/highlights on certain line-items that give more explanation of the chapter topic.

@rviscomi as discussed, we will need a new metric gathered, the one we observed that checks for JSON-LD validity in lighthouse.

cheers,

Allen.

commented

Hi team, I wanted to check in on the analysis. It looks like you might be behind (or I'm not finding the resources). This step is critical to get started and progress made in order to keep the writing and publication date on track. What are the next steps to get started? Do you need any help?

@DataBytzAI | @cyberandy @DataBytzAI | @SeoRobt @jonoalderson | @rviscomi

commented

Hi @DataBytzAI I think @rviscomi is the one that needs your time more than me so that he can figure out which metrics to write queries for.

commented

@DataBytzAI @NishuGoel any progress on this? It looks like the draft PR doesn't have any queries added to it yet. We only have about 8 days for the queries to be written, reviewed, run, data put into the sheets document, and charts created. Otherwise the writing could be at risk to miss the publish date.

All 2021 queries have been migrated to 2022 and results saved to the chapter sheet. I've also added all the same pivot tables and charts that were used last year, except for the Sankey diagram in figure 12. @GregBrimble did you create that last year, and would you be able to recreate it with this year's data?

@DataBytzAI over to you to start drafting the chapter. If you have any other particular metrics in mind beyond what's available in the sheet, let me know.

Not me! That was @tunetheweb 's fine work: #2560

I would love to take credit for that lovely diagram, but I just added it to git and inserted it into the chapter markdown. @cyberandy created the diagram with some useful input from @jonoalderson and less-useful nitpicking from me 😁

commented

@DataBytzAI when do you think you can begin the draft? Just as a reminder, the due date at the end of the month is for post-review and post-edit, so you'll need to set aside at least a week for those and preferably more.

Sorry for the disruption @rick! πŸ˜†

commented

Hi all, great job getting started on the draft! Can it be ready for review within the next 1-2 days? The reviewers need time to review it and then authors need time to incorporate feedback, and then we need to do a writing style/usage editing cycle all before August 31 to stay on track. Let us know your status

@DataBytzAI | @cyberandy @DataBytzAI @JohnBarrettWDW | @SeoRobt @jonoalderson

Hi @siakaramalegos ... we expect to have the main body of it completed today, tomorrow latest...
Andrea is doing the sankey chart for relationship data ...@cyberandy can you give us an update on that?
Otherwise I have sent a message to the other on the list for any contributions they can add or for them to please commence input/editing! .... I think we are good to be on track for final review before Monday!

commented

Great progress so far team! What's next to finish up the draft and reviews? We'd really like to get this published in the first cycle. @DataBytzAI @cyberandy @DataBytzAI @JohnBarrettWDW | @SeoRobt @jonoalderson

Thanks @siakaramalegos ... late start but glad we were able to bring it together!
We feel we have the core of things complete ... @cyberandy was doing some charts that I think are complete - Andrea you might comment please? .... also @jonoalderson and @SeoRobt are currently doing their final review I think - folks can you comment on where you are with that? ... finally, hats off to the amazing @JasmineDWillson who has wielded her quill and ink and brought excellent structure to bear on the overall document as it now stands :D

I'm unexpectedly behind due to some illness this week 😩

I'm aiming to carve out a chunk of the weekend to review, but don't wait on me if I'm too late to the party.

I'm planning on adding a couple of small comments over the weekend for my final input. Nothing major on my end!

@cyberandy what do we need to do to add your charts (and any related comment) to the document? ...
Far as I can see that's the only thing we have outstanding now?

cc: @siakaramalegos @rviscomi

commented

@DataBytzAI | @cyberandy @DataBytzAI @JohnBarrettWDW @SeoRobt @jonoalderson what's the status on the chapter? Are all parts written? It looks like it only has 1 technical review so far - any way to get another? We're already more than a week behind schedule and the editing cannot happen until technical review is complete.

@siakaramalegos working on a technical - brb!

@DataBytzAI | @cyberandy @DataBytzAI @JohnBarrettWDW @SeoRobt @jonoalderson what's the status on the chapter? Are all parts written? It looks like it only has 1 technical review so far - any way to get another? We're already more than a week behind schedule and the editing cannot happen until technical review is complete.

@siakaramalegos ... done - we have @jasonbellwebdataworks onboard ... will have that review completed today fingers crossed!

@DataBytzAI any update on resolving the remaining feedback in the doc and regarding the sankey diagrams? Getting close to the deadline to be included in the Monday release.

Sorry to hear about @jonoalderson, @DataBytzAI shall I review the document on Drive or here?