OperationCode / operationcode_backend

This is the backend repo for the Operation Code website

Home Page:https://operationcode.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug: Various sentry bugs

apex-omontgomery opened this issue · comments

Using this as a placeholder for all the new bugs until we can write out full issues:

1. ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pool...
2. Slack::Client::InviteFailed: {"ok"=>false, "error"=>"token_revoked"}
3. ActiveJob::DeserializationError: Error while trying to deserialize arguments: FATAL: database "opcode-postgres" does not exist
4. ActiveRecord::NoDatabaseError: FATAL: database "opcode-postgres" does not exist
5. ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pool...

It appears when new users sign up we get a group of (2, 4, 3) at the same time.

I think 2 can be corrected by adding a legacy token to environment vars.

The other ones will take more troubleshooting and I'd like to get the full stack traces.

Two core issues were determined related to sidekiq/ redis queue/ background jobs.

  1. slack api problems
  2. database problems

Slack API Root Cause

The root cause of the slack api problems was caused by disabling a slack application on the workspace.

This happened during a great idea (but little hasty) of mine to clean up all the random slack applicatons on the workspace. One of the applications appeared to be non-functional but the key used by it was was being used in a oc repo . This repo is hosted somewhere and is used as way to communicate with slack to invite users. It was found that BE was sending this repo a HTTP request to invite users and this repo was processing them.

By removing the code that added items to this queue we thought this would be resolved. When the issues still occurred it was realized that the jobs were being retried up to the default ruby maximum. By keeping the ActiveJob in place for this event, but removing the call to external api and logging the queue we will slowly remove these requests.

Currently @hollomancer is manually inviting users and seeing a better response. We think this is due to customized messages.

Database Issues Root Cause

Again this was a sidekiq issue, we saw that the primary error was due to not finding the database opcode-postgres. Seeing as the database was found in the main rails application but not the sidekiq this pointed to a config issue.

We stopped adding new sidekiq jobs in the pathways that caused the issue, and saw the errors continue. Like above we realized that the rails were performing a retry, and after some time we noticed that no new jobs were happening, sidekiq will retry 25 times over 21 days. So eventually these errors would go away.

Investigation showed that in Infra we use the same environment variables for opcode-postgres and john harris stated:

Sure @nellshamrell @long_common_name, the POSTGRES_HOST is being set as an env var in both the app and sidekiq containers. It's being set to the same value (´opcode-postgres) in both though. That is a host that resolves via in-cluster dns to the service defined in database-service-prod.yaml` (which in turn points out to the prod rds instance). Dns resolves per namespace, so we did this so the deployment yaml could stay env agnostic, and we could just create a stage db yaml that just had a different externalname and apply to the stage env.

A couple of red herrings came up:
both BE and Infra mention a config file that doesn't exist:
https://github.com/OperationCode/operationcode_infra/blob/b33df9647f38785a2d6a0823d238cbab567b25a7/kubernetes/operationcode_backend/deployment.yml#L91

But ultimately it was found that we were setting a variable DATABASE_URL for only sidekiq.

@robbkidd noted that: database config expects a different format and that we were putting the host url in as a database url.

By removing the line that sets this value. sidekiq would use default rails database connections in database.yml

This was seen by:

make run # terminal 1
docker-compose run web rails console  # terminal 2
SendEmailToLeadersJob.perform_later(1) # terminal 2

Since robb had no users in his local db he recieved this error:

sidekiq_1             | 2018-09-05T02:03:28.851Z 1 TID-y8095 SendEmailToLeadersJob JID-4f55751d6f3130f259798328 INFO: Performing SendEmailToLeadersJob from Sidekiq(default) with arguments: 1
...
sidekiq_1             | 2018-09-05T02:03:28.952Z 1 TID-y8095 WARN: ActiveRecord::RecordNotFound: Couldn't find User with 'id'=1
sidekiq_1             | 2018-09-05T02:03:28.952Z 1 TID-y8095 WARN: /bundle/gems/activerecord-5.0.2/lib/active_record/core.rb:173:in `find'
sidekiq_1             | /app/app/jobs/send_email_to_leaders_job.rb:5:in `perform' 

this error is better than

sidekiq_1             | 2018-09-05T02:01:39.532Z 1 TID-ppzq5 WARN: ActiveRecord::NoDatabaseError: FATAL:  database "operationcode-psql" does not exist

As of today our sentry errrors are wayyyy down:

image 1

Recommended actions:

  1. Configure sentry to better alert to problems, but in a way that doesn't create alert fatigue
  2. Better documentation on the background jobs.
  3. Better documentation of environment variables
  4. Standardize usage of environment variables
  5. Investigate sidekiq configs to make the traces more human readable.
  6. Investigate ways to automate custom messages for slack invites @AshTemp has some ideas.

Personal comments:

I think this issue took so long due to my lack of domain and language knowledge, I really thank people like @nellshamrell and @robbkidd for filling in the gaps. @ohaiwalt was also a boss in regards to undertanding Infra.

In addition I was personally hampered due to lack of logs, and getting sentry access, I think we should find ways to make it so open source contributors can view these items without compromising OC security. Perhaps this is another use case for a staging environment.