ampproject / error-tracker

AMP Project's error logging server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I2I: Error Reporting Migration Plan

rcebulko opened this issue · comments

With the migration to Cloud Functions, several things have to happen to fully migrate error reporting and introduce new features. Since Cloud Functions does not provide built-in traffic migration or versioning, we have a stable and beta channel with different URLs that the Runtime will divert between to allow testing new error reporting deployments. Stable is promoted explicitly by pushing a git tag, while beta releases are built automatically from master on each commit.

The new error reporting endpoints live at new URLs:

  • stable: https://us-central1-amp-error-reporting.cloudfunctions.net/r
  • beta: https://us-central1-amp-error-reporting.cloudfunctions.net/r-beta

Migrating to the new error reporting conventions includes a handful of changes:

  • introducing the stable Cloud Functions endpoint /r to the AMP Runtime and slowly migrating error reporting traffic over
  • introducing the beta endpoint /r-beta to the AMP Runtime and migrating a slice of traffic over
  • changing service bucket names and version identifiers to more useful, readable groupings
  • splitting expected errors into a separate project for monitoring

To minimize risk of something breaking, and to minimize the impact of the changes on developer workflow, I propose the following plan.

Week of March 30

  • introduce the stable endpoint as an AMP Runtime experiment (ampproject/amphtml#27501) for 20% of traffic in Beta/Experimental
  • introduce the stable endpoint as an AMP Runtime experiment (ampproject/amphtml#27501) for 20% of traffic in Beta/Experimental opt-in

Week of April 6

  • new endpoint traffic of 20% reaches Beta/Experimental 1%

This week allows us to confirm the new reporting endpoint reports correctly and scales as needed.

Week of April 13

  • new endpoint traffic of 20% reaches Stable
  • ramp traffic to the new endpoint to 100% in Beta/Experimental (ampproject/amphtml#27682)
  • modify split for the beta endpoint: 10% of new endpoint traffic (ampproject/amphtml#27682)
  • split expected errors into a separate logging project in beta endpoint (#124)

This week allows us to verify that the new reporting endpoint can scale to production traffic, and validates the 90-10 split to the beta endpoint.

Week of April 20

  • new endpoint traffic of 100% (90%/10%) reaches Stable

By this point, we have all errors being reported through the new endpoints, so we can start making changes to service buckets and version names while maintaining enough reporting volume in both the legacy and new buckets to ensure a useful amount of data in one or both at all times during the transition.

Week of April 27

  • promote expected error split into stable endpoint
  • rename version IDs to readable names in the beta endpoint (#127)
  • introduce service bucket renaming into beta endpoint (#128)

Week of May 4

  • promote service bucket and version renaming into stable endpoint
  • party 🎉

Week of May 11

  • Deep Dive on all Error Reporting process & infrastructure changes

/cc @rsimha @ampproject/wg-runtime

Last week due to cherry-picks and P0s, there was no release. I've updated all dates here to reflect the delay.

At this point, with 20% of errors in Beta/Experimantal (and 100% of Nightly) reporting to the new endpoint, we can see a very tight sync between the legacy and new endpoints:

image

Green is GAE-Production; orange is GAE-Canary; purple is GCF-Canary.

This seems to indicate the new endpoint is logging as expected.

Promotion of the 20% split to the new endpoint seems to have gone off without a hitch so far:

image

Relative increases and decreases to reporting volume, throttle rates, and a handful of other metrics all seem to be in line with expectations.

It's vividly clear that the traffic to the new endpoint went up by almost exactly the 10x we'd expect:

image

It also appears that scaling up has made response times way more consistent (and lower in general)

image

Looks like this is scaling very well!

Vividly clear that the traffic to the new endpoint went up by almost exactly the 10x we'd expect:

image