ampproject / error-tracker

AMP Project's error logging server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I2I: Migrate error reporting handler to serverless deployment

rcebulko opened this issue · comments

Background

Currently, this application is run on Google App Engine (Flexible) with automatic instance scaling between 4 and 100 instances (see app.yaml), typically floating around ~20 instances.

As a result of settings that have been deprecated since the error-tracker was last deployed, it's currently in a state where re-deploying causes a breakage related to GAE's readiness and liveness checks. @jridgewell and I have tried without success to identify the source. For an example of the impact of this, LTS errors can't currently be logged, since the update to support them is seemingly impossible to deploy

Proposal

One way to simplify the deployment process and unblock updates to this app is to move to a serverless deployment to Cloud Functions. As can be seen in the app entrypoint, the server defines very few endpoints (health checks which respond 200 and /r which delegates work to an errorTracker handler). It would be relatively straightforward to instead package these handler functions as two Cloud Functions--or even just one, since the health check would no longer be relevant.

Challenges

1. Caching the source maps and RTV diversions

Currently each GAE instance of the app fetches https://cdn.ampproject.org/rtv/metadata to identify which RTVs are currently being served, and drop errors from all older RTVs. The app uses a 5-50min cache for the diversion list, and these caches can at times be out-of-sync across the instances. It also caches source maps for 2 weeks to provide unminified line numbers in stack traces.

Cloud Functions are executed independently with no guarantees about the environment or local state. Initially, some form of Cloud Storage or Memorystore were considered to hold this RTV cache and share it across functions. However, according to the Optimizing Networking section, global state is often maintained between function invocations.

Cloud Functions spins up as many instances as required to handle the load, and only the exported function is called. Dependency loading, and creation/instantiation of global variables, happens only a) at the first function deployment and b) when new instances are started to scale up. In other words, if the cache is declared in the global scope outside the handler function, it can still be shared by function invocations on that instance. We will need to test/monitor the behavior of this scaling, but it appears the behavior will be similar to GAE instance scaling.

The same logic applies to the keys fetched at startup, though it may be possible to instead embed the keys directly in the Cloud Function configuration in Pantheon, eliminating the need to fetch them explicitly

2. New handler URL

Cloud Functions have a distinct handler URL of the form https://[REGION]-[PROJECT_NAME].cloudfunctions.net/[FUNCTION_NAME]. This means to actually use a serverless deployment, error-reporting code in amphtml (and any other projects that report to the app currently) would need to be updated to use the new URL.

We would likely want to do this behind a flag in the Experimental build, or configure error reporting to report to the new URL X% of the time and ramp up, so we can migrate to the new deployment gradually. Despite the clients reporting to two different endpoints, the errors themselves would still end up in the same place, so this migration should not impact our ability to monitor or assess error levels.

One alternative was to forward requests from the App Engine app to the Cloud Functions URL, but since we can't update the GAE app, that idea is moot.

/cc @ampproject/wg-infra @ampproject/wg-runtime

Glad to see progress here. Are you planning on working on this soon?

global state is often maintained between function invocations.

Excellent. I think https://cloud.google.com/functions/docs/bestpractices/tips#use_global_variables_to_reuse_objects_in_future_invocations is the more relevant documentation, but it's the same point. This is excellent, and means we won't have to do much refactoring.

This means to actually use a serverless deployment, error-reporting code in amphtml (and any other projects that report to the app currently) would need to be updated to use the new URL.

Happy to help here.

Glad to see progress here. Are you planning on working on this soon?

Probably, given that the inability to deploy updates to the service makes it a blocker for most other work around error reporting.

Excellent. I think https://cloud.google.com/functions/docs/bestpractices/tips#use_global_variables_to_reuse_objects_in_future_invocations is the more relevant documentation, but it's the same point. This is excellent, and means we won't have to do much refactoring.

Yep I saw that too, and yeah this definitely simplifies some things.

Happy to help here.

I'll loop you in once we're in a position to start feeding traffic to the function-based version.

This was done by @rcebulko a long time ago :D