[RFC] Get Sentry Errors to Meaningful State for Our Critical 3 Apps/Services

Question

[RFC] Get Sentry Errors to Meaningful State for Our Critical 3 Apps/Services

ashkan18 opened this issue 5 years ago · comments

Proposal:

We need to bring our Sentry errors rate to an acceptable meaningful number for our Critical 3 apps/services and go below our current 500k limit so we can monitor them and properly resolve them and eventually get to 0 event in production (its possible, yes we can!).

Option 1

On Oct 1st we archive all existing errors for our highest Criticality services/apps on Sentry to get into "clean state".

From Oct 1st, each practice assign a dedicated person during their meeting till next meeting for watching specific critical app/service on Sentry.

When a new Sentry error happened in one of our Critically 3 apps one of following actions needs to happen by dedicated person:

If this error should be caught and not a real bug (missing a catch), in proper service wrap it with proper try, catch.
if this error is real bug and can't be easily fixed in 10mins, open a Jira ticket and add it to backlog of proper product team and let the TL of the team know.
Resolve the issue in Sentry

Option 2

Similar as option 1 but slightly less aggressive by handling this on staging.

Every Monday, we clean up Sentry errors on Staging. Each practice assigns a dedicated person for each week to monitor staging sentry for critical apps/services.

When a new error happens:

If this error should be caught and not a real bug (missing a catch), in proper service wrap it with proper try, catch.
if this error is real bug and can't be easily fixed in 10mins, open a Jira ticket and add it to backlog of proper product team and let the TL of the team know.
Resolve the issue in Sentry

Reasoning

Our Sentry error rate in some of our critical apps is in the level that makes it super hard, if not impossible, to know what errors are real and need to be deal with and whats something thats resolved.

Benefits of keeping a healthy Sentry error log:

We can monitor them actively and prevent/detect/resolve incidents sooner
We can find production errors faster and easier
We can go below our current 500k event limit and even go to smaller plan (#reduce-burn) (ask @joeyAghion how much we are paying for Sentry)
No need for bug bash as a one day event

Exceptions:

none

Additional Context:

Some context on slack
Example fixing PR https://github.com/artsy/gravity/pull/12577

How is this RFC resolved?

Deciding between option 1 and 2. Also making sure we don't go above our limits from Nov 1st.

Ash Furrow · Answer 1 · Thu Sep 12 2019 04:11:22 GMT+0800 (China Standard Time)

I'm a 👍 on this. Option 1️⃣ makes sense to me – why not be aggressive about error-handling? I also dislike the idea of further complicating our weekly staging/production process.

Matt Zikherman · Answer 2 · Fri Sep 13 2019 00:18:17 GMT+0800 (China Standard Time)

Also down for option 1. At some point declaring bankruptcy isn't a bad idea. I don't think we've really had a particularly great process or hygiene about keeping Sentry clear. I personally go thru phases of more regularly checking it out, vs. not.

It seems 'cleaner' to be able to start fresh and with renewed process and sense of ownership. While I totally understand the arguments for not doing that and trying to go through the existing issues, it feels very onerous.

I would be cautiously in favor of 1️⃣

Isac Petruzzi · Answer 3 · Fri Sep 13 2019 00:26:41 GMT+0800 (China Standard Time)

I am also in favor of option 1️⃣- Sentry, as a tool to spot regressions does not make sense unless we are already using this system, so with the goal being to address all errors from that perspective a clean slate seems like the right choice.

Joey Aghion · Answer 4 · Fri Sep 13 2019 00:45:33 GMT+0800 (China Standard Time)

I'm open to 1️⃣. Personally, I expect the designated individuals to get overwhelmed pretty quickly, but that frustration could be a good thing. The attempt will give us a first-hand sense of what it takes and we can evolve these expectations based on that.

Who decides whether a bug is front-end or back-end or mobile, and so which practice's designated individual will triage it?

Joey Aghion · Answer 5 · Fri Sep 13 2019 00:49:05 GMT+0800 (China Standard Time)

Also, a clarification: are you suggesting we do this for all 12 critical systems at once? That might be a lot of volume. Maybe fine, or maybe worth starting slightly more narrow.

Ashkan Nasseri · Answer 6 · Fri Sep 13 2019 04:36:26 GMT+0800 (China Standard Time)

@joeyAghion maybe we can start with services/apps with most events at the beginning and leave more quiet ones for later. Im thinking of:

To begin with, if we go with this list i can imagine following practices for each:

MP - platform/front end
Eigen - front end ios
Force - front end
Gravity - platform

Matt Zikherman · Answer 7 · Fri Sep 13 2019 04:49:38 GMT+0800 (China Standard Time)

Especially if we're settling around option 1️⃣ and are interesting in discussing more implementation (which projects are cleaned and who are point people, etc.)- that could be a good platform practice meeting topic.

Ash Furrow · Answer 8 · Fri Sep 13 2019 22:49:51 GMT+0800 (China Standard Time)

Just discussed this with the FE iOS Practice. If we have the Mobile Experience team in place by October 3, ideally this would be something an IC from that team could take on.

Specifically with iOS, I wanted to bring up something different from our web software. Previous versions of Artsy's app are still being used, even though they might be outdated. Here is a crash from this morning generated by a version of our app that we shipped 9 months ago (a fix for the bug has since been shipped, but this user hasn't upgraded to that new version). That's all to say that wiping the slate clean would be particularly valuable for our iOS software (as @kierangillen pointed out) this generates a lot of noise that makes Sentry less helpful.

Eloy Durán · Answer 9 · Fri Sep 13 2019 22:57:33 GMT+0800 (China Standard Time)

Nice, let’s do it 1️⃣👍

Chung-Yi Chi · Answer 10 · Thu Sep 19 2019 22:02:25 GMT+0800 (China Standard Time)

I'm 👍 1️⃣.

I'd also suggest each team to use this opportunity to brainstorm how we handle errors with Sentry, according to the project's needs. For example, ways to speed up searching for errors, how to proactively alert errors, maybe we want to use tagging or other Sentry native features, etc.

I'll be very interested in comparing notes and learn from each project in a few months.

Daniel Levenson · Answer 11 · Mon Sep 23 2019 21:57:24 GMT+0800 (China Standard Time)

Also 👍 on 1️⃣. To be clear, you are only proposing that we "resolve" a record in Sentry once we either app-level suppress it via some form of try/catch or fix the underlying bug, not after just getting alerted.

Ashkan Nasseri · Answer 12 · Tue Sep 24 2019 00:53:09 GMT+0800 (China Standard Time)

I will add instructions around this, but yes, there is no point to resolve it in Sentry if its not actually fixed/handled properly in the code and will come back again 🙃

Ashkan Nasseri · Answer 13 · Tue Sep 24 2019 03:51:17 GMT+0800 (China Standard Time)

Resolution

We decided to go with the option 1 offered by this RFC.

Level of Support

2: Positive feedback.

Additional Context:

People reviewed and read this RFC were all for doing this and in favor of option 1. There were some feedback in how to exercise this option and also how to share our learnings between teams as each team focuses on their project in Sentry.
Some of our projects, like Eigen have more complicated and different release cycles, we need to accommodate and possibly change this clean up considering the project limitations (ex. old versions out in the field) and release cycles.

Next Steps

On Oct 1st, we will clean Sentry errors for following projects on production.
- MP
- Eigen
- Force
- Gravity
From Oct 1st each practice will need to actively monitor and manage Sentry errors coming in:
- Front End: Force
- Front End iOS: Eigen
- Platform: Gravity + MP
Each practice can create their own slack channel for monitoring sentry errors coming in for this channel. Whoever is looking at specific sentry error can tag that on slack with 👋 emoji to avoid duplicate attempts of fixing something. (*optional)
A Sentry issue can be resolved if:
- A PR is opened to fix the issue.
- A PR is opened to properly catch the error and log it. When opening PR make sure:
  - label the PR with sentry-fix label.
  - Point to the url of the issue in Sentry. (example PR: https://github.com/artsy/gravity/pull/12577)
A Sentry issue can be ignored on Sentry if:
- This error could totally be ignored (its an older iOS version)
- Error was caused by other one time spikes or service unavailability.

Exceptions

A Sentry issue can be leave as is if the person looking at it could not possibly fix the issue in ~10 minutes and the error seems more involved. In this case we should create a Jira ticket in proper product board and notify TL and MP about it.