Bug: Various sentry bugs
apex-omontgomery opened this issue · comments
Using this as a placeholder for all the new bugs until we can write out full issues:
1. ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pool...
2. Slack::Client::InviteFailed: {"ok"=>false, "error"=>"token_revoked"}
3. ActiveJob::DeserializationError: Error while trying to deserialize arguments: FATAL: database "opcode-postgres" does not exist
4. ActiveRecord::NoDatabaseError: FATAL: database "opcode-postgres" does not exist
5. ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pool...
It appears when new users sign up we get a group of (2, 4, 3) at the same time.
I think 2 can be corrected by adding a legacy token to environment vars.
The other ones will take more troubleshooting and I'd like to get the full stack traces.
Two core issues were determined related to sidekiq/ redis queue/ background jobs.
- slack api problems
- database problems
Slack API Root Cause
The root cause of the slack api problems was caused by disabling a slack application on the workspace.
This happened during a great idea (but little hasty) of mine to clean up all the random slack applicatons on the workspace. One of the applications appeared to be non-functional but the key used by it was was being used in a oc repo . This repo is hosted somewhere and is used as way to communicate with slack to invite users. It was found that BE was sending this repo a HTTP request to invite users and this repo was processing them.
By removing the code that added items to this queue we thought this would be resolved. When the issues still occurred it was realized that the jobs were being retried up to the default ruby maximum. By keeping the ActiveJob in place for this event, but removing the call to external api and logging the queue we will slowly remove these requests.
Currently @hollomancer is manually inviting users and seeing a better response. We think this is due to customized messages.
Database Issues Root Cause
Again this was a sidekiq issue, we saw that the primary error was due to not finding the database opcode-postgres
. Seeing as the database was found in the main rails application but not the sidekiq this pointed to a config issue.
We stopped adding new sidekiq jobs in the pathways that caused the issue, and saw the errors continue. Like above we realized that the rails were performing a retry, and after some time we noticed that no new jobs were happening, sidekiq will retry 25 times over 21 days. So eventually these errors would go away.
Investigation showed that in Infra we use the same environment variables for opcode-postgres
and john harris stated:
Sure @nellshamrell @long_common_name, the
POSTGRES_HOST
is being set as an env var in both the app and sidekiq containers. It's being set to the same value (´opcode-postgres) in both though. That is a host that resolves via in-cluster dns to the service defined in
database-service-prod.yaml` (which in turn points out to the prod rds instance). Dns resolves per namespace, so we did this so the deployment yaml could stay env agnostic, and we could just create a stage db yaml that just had a different externalname and apply to the stage env.
A couple of red herrings came up:
both BE and Infra mention a config file that doesn't exist:
https://github.com/OperationCode/operationcode_infra/blob/b33df9647f38785a2d6a0823d238cbab567b25a7/kubernetes/operationcode_backend/deployment.yml#L91
But ultimately it was found that we were setting a variable DATABASE_URL
for only sidekiq.
@robbkidd noted that: database config expects a different format and that we were putting the host url in as a database url.
By removing the line that sets this value. sidekiq would use default rails database connections in database.yml
This was seen by:
make run # terminal 1
docker-compose run web rails console # terminal 2
SendEmailToLeadersJob.perform_later(1) # terminal 2
Since robb had no users in his local db he recieved this error:
sidekiq_1 | 2018-09-05T02:03:28.851Z 1 TID-y8095 SendEmailToLeadersJob JID-4f55751d6f3130f259798328 INFO: Performing SendEmailToLeadersJob from Sidekiq(default) with arguments: 1
...
sidekiq_1 | 2018-09-05T02:03:28.952Z 1 TID-y8095 WARN: ActiveRecord::RecordNotFound: Couldn't find User with 'id'=1
sidekiq_1 | 2018-09-05T02:03:28.952Z 1 TID-y8095 WARN: /bundle/gems/activerecord-5.0.2/lib/active_record/core.rb:173:in `find'
sidekiq_1 | /app/app/jobs/send_email_to_leaders_job.rb:5:in `perform'
this error is better than
sidekiq_1 | 2018-09-05T02:01:39.532Z 1 TID-ppzq5 WARN: ActiveRecord::NoDatabaseError: FATAL: database "operationcode-psql" does not exist
As of today our sentry errrors are wayyyy down:
Recommended actions:
- Configure sentry to better alert to problems, but in a way that doesn't create alert fatigue
- Better documentation on the background jobs.
- Better documentation of environment variables
- Standardize usage of environment variables
- Investigate sidekiq configs to make the traces more human readable.
- Investigate ways to automate custom messages for slack invites @AshTemp has some ideas.
Personal comments:
I think this issue took so long due to my lack of domain and language knowledge, I really thank people like @nellshamrell and @robbkidd for filling in the gaps. @ohaiwalt was also a boss in regards to undertanding Infra.
In addition I was personally hampered due to lack of logs, and getting sentry access, I think we should find ways to make it so open source contributors can view these items without compromising OC security. Perhaps this is another use case for a staging environment.
👍