Upgrade RabbitMQ on ocp13 and ocp23 like we do Elasticsearch

Question

Upgrade RabbitMQ on ocp13 and ocp23 like we do Elasticsearch

jpmckinney opened this issue a year ago · comments

i.e. apt-mark hold rabbitmq-server and then only upgrade for security issues.

Why

When RabbitMQ is upgraded, it doesn't simply reload – it shuts down and restart, which involves closing all connections.

We’re using Pika (the most popular Python library for RabbitMQ), which has a BlockingConnection class, which creates a synchronous (a.k.a. blocking) layer on top of Pika’s asynchronous core. This is because all our code is otherwise synchronous (though concurrent via threads), and it would be a big lift to adopt Python’s native async features. However, when an async event occurs like a closing connection, the behavior is for the exception to arise as if from nowhere (i.e. there is no backtrace in Sentry, the error cannot be caught by existing try/except blocks, etc.). The consequence is that dozens of distinct issues are created on Sentry for the kingfisher-process, pelican-backend, data-registry Docker apps – across two servers (ocp13 and ocp23).

Unless we go fully async (and I’m only guessing the situation would be simpler if we did), then there’s nothing we can do at the application level to catch or recover from these ConnectionClosedByBroker errors.

Deleted user · Answer 1 · Fri Jun 30 2023 22:18:31 GMT+0800 (China Standard Time)

To-do:

Set up monitoring to alert when RabbitMQ updates are released
Write patching process
Put packages on hold

@jpmckinney, Would it help if we stopped the Docker application while RabbitMQ is patched? or would this create other alert?

James McKinney · Answer 2 · Fri Jun 30 2023 22:41:20 GMT+0800 (China Standard Time)

Stopping Docker should be fine, as the application is designed to shutdown gracefully and survive restarts.

I spent a bit of time yesterday looking into an asynchronous RabbitMQ client, and it might not be so much work – I'll do some testing today, and see if it's promising. If so, we can skip this issue.

James McKinney · Answer 3 · Tue Jul 04 2023 03:04:40 GMT+0800 (China Standard Time)

I successfully updated our RabbitMQ client to work asynchronously and survive RabbitMQ restarts, so this issue is resolved.

I just need to update the client on data-registry and kingfisher-collect (already did kingfisher-process, pelican-backend, pelican-frontend). I'm still using a blocking (synchronous) client for short-lived message publication, as otherwise I need to run the client's IO loop in a new thread, which is extra complexity for an unlikely scenario (i.e. RabbitMQ restarts during one of those short-lived connections).

The rest is just some notes to remind myself and others at a future date why the connection delay is 15s.

Looking at a recent RabbitMQ restart, I see (abbreviated):

08:58:25 RabbitMQ is asked to stop...
08:58:26 Stopping application 'rabbit'
08:58:26 stopped TCP listener on [::]:5672
08:58:26 [error] Error on AMQP connection <0.35485.0> (172.24.0.2:38242 -> 172.17.0.1:5672, vhost: '/', user: 'pelican_backend', state: running), channel 0:
08:58:26 [error] operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
... more of the same within 1 millisecond, then a 5 second gap in log messages ...
08:58:31 [notice] Application rabbit exited with reason: stopped
08:58:31 [info] Successfully stopped RabbitMQ and its dependencies
08:58:31 [info] Halting Erlang VM with the following applications:
08:58:35 [notice] Logging: configured log handlers are now ACTIVE
08:58:35 [info] Starting RabbitMQ 3.12.1 on Erlang 26.0.2 [jit]
08:58:36[ info] Ready to start client connection listeners
08:58:36[ info] started TCP listener on [::]:5672
08:58:36[ info] Server startup complete; 3 plugins started.

So, it's 10 seconds from the exception being received by the client, to new connections being possible. The client defaults to reconnecting every 15 seconds.

We can make it attempt to reconnect every 1 second, and therefore reconnect 4-5s faster.

However, we presently log connection failures at ERROR level, because if, e.g., RabbitMQ is down for an extended period, we want to know (while we have other ways of monitoring RabbitMQ, we don't necessarily have them in all cases).

So, to avoid having spurious error messages logged to Sentry, we can use a longer connection delay.