I2I: Error Reporting Migration Plan

Question

I2I: Error Reporting Migration Plan

rcebulko opened this issue 4 years ago · comments

With the migration to Cloud Functions, several things have to happen to fully migrate error reporting and introduce new features. Since Cloud Functions does not provide built-in traffic migration or versioning, we have a stable and beta channel with different URLs that the Runtime will divert between to allow testing new error reporting deployments. Stable is promoted explicitly by pushing a git tag, while beta releases are built automatically from master on each commit.

The new error reporting endpoints live at new URLs:

stable: https://us-central1-amp-error-reporting.cloudfunctions.net/r
beta: https://us-central1-amp-error-reporting.cloudfunctions.net/r-beta

Migrating to the new error reporting conventions includes a handful of changes:

introducing the stable Cloud Functions endpoint /r to the AMP Runtime and slowly migrating error reporting traffic over
introducing the beta endpoint /r-beta to the AMP Runtime and migrating a slice of traffic over
changing service bucket names and version identifiers to more useful, readable groupings
splitting expected errors into a separate project for monitoring

To minimize risk of something breaking, and to minimize the impact of the changes on developer workflow, I propose the following plan.

Week of March 30

~~introduce the stable endpoint as an AMP Runtime experiment (ampproject/amphtml#27501) for 20% of traffic in Beta/Experimental~~
introduce the stable endpoint as an AMP Runtime experiment (ampproject/amphtml#27501) for 20% of traffic in Beta/Experimental opt-in

Week of April 6

new endpoint traffic of 20% reaches Beta/Experimental 1%

This week allows us to confirm the new reporting endpoint reports correctly and scales as needed.

Week of April 13

new endpoint traffic of 20% reaches Stable
ramp traffic to the new endpoint to 100% in Beta/Experimental (ampproject/amphtml#27682)
modify split for the beta endpoint: 10% of new endpoint traffic (ampproject/amphtml#27682)
split expected errors into a separate logging project in beta endpoint (#124)

This week allows us to verify that the new reporting endpoint can scale to production traffic, and validates the 90-10 split to the beta endpoint.

Week of April 20

new endpoint traffic of 100% (90%/10%) reaches Stable

By this point, we have all errors being reported through the new endpoints, so we can start making changes to service buckets and version names while maintaining enough reporting volume in both the legacy and new buckets to ensure a useful amount of data in one or both at all times during the transition.

Week of April 27

promote expected error split into stable endpoint
rename version IDs to readable names in the beta endpoint (#127)
introduce service bucket renaming into beta endpoint (#128)

Week of May 4

promote service bucket and version renaming into stable endpoint
party 🎉

Week of May 11

Deep Dive on all Error Reporting process & infrastructure changes

/cc @rsimha @ampproject/wg-runtime

Ryan Cebulko · Answer 1 · Tue Apr 07 2020 05:02:09 GMT+0800 (China Standard Time)

Last week due to cherry-picks and P0s, there was no release. I've updated all dates here to reflect the delay.

Ryan Cebulko · Answer 2 · Sat Apr 11 2020 05:26:35 GMT+0800 (China Standard Time)

At this point, with 20% of errors in Beta/Experimantal (and 100% of Nightly) reporting to the new endpoint, we can see a very tight sync between the legacy and new endpoints:

Green is GAE-Production; orange is GAE-Canary; purple is GCF-Canary.

This seems to indicate the new endpoint is logging as expected.

Ryan Cebulko · Answer 3 · Wed Apr 15 2020 05:50:51 GMT+0800 (China Standard Time)

Promotion of the 20% split to the new endpoint seems to have gone off without a hitch so far:

Relative increases and decreases to reporting volume, throttle rates, and a handful of other metrics all seem to be in line with expectations.

It's vividly clear that the traffic to the new endpoint went up by almost exactly the 10x we'd expect:

It also appears that scaling up has made response times way more consistent (and lower in general)

Looks like this is scaling very well!

Ryan Cebulko · Answer 4 · Wed Apr 15 2020 05:52:11 GMT+0800 (China Standard Time)

Vividly clear that the traffic to the new endpoint went up by almost exactly the 10x we'd expect: