django / daphne

Django Channels HTTP/WebSocket server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Django running behind multiple Daphne instances opens too many database connections

samul-1 opened this issue · comments

I have a Django application that's deployed on Dokku and runs Daphne. The application is connected to a Postgres database and have a conn_max_age setting of 60 seconds for that connection. I also have the ATOMIC_REQUESTS setting enabled.

Daphne is run this way:

daphne core.asgi:application --port $PORT --bind 0.0.0.0 -v2

I used to run only one Daphne instance, but I recently switched to running 3 replicas for better performance.
Ever since I did that, I started seeing this error in Django:

OperationalError
connection to server at "dokku-postgres-sai-evo-db" (172.17.0.11), port 5432 failed: FATAL: sorry, too many clients already

This is a very serious error as it causes the whole application to go down until some connections are dropped.

It appears that the different instances of Daphne aren't able to re-use the idle connections left open by the other instances and are therefore opening new connections to the database.

My current max_connections setting in Postgres is set at 350, so it's a pretty high threshold and this issue shouldn't be happening.

For full context, this is the repository of my application: https://github.com/Evo-Learning-project/sai_evo_backend
On top of the 3 replicas of Daphne, I also have 2 replicas of Celery running.

Is there something I'm missing? How do I make the instances of Daphne aware of the usable, open connections with the db?

It looks like you'll need to use a connection pool such as pg_bouncer

Due to running on Dokku, it's probably gonna be hard to install pg_bouncer or any db-side tools. Do you think this one could be a good alternative? https://github.com/lcd1232/django-postgrespool2

I've not seen that one I'm afraid. But you can give it a try and feedback.

I tried django-postgrespool2, and it appears that the daphne instances don't share the same pool.
Apparently, the number of connections keeps growing until it hits the limit on postgres. The app keeps working as it uses the connections inside the pool, but eventually it'll try to create a new connection and fail.

What's something else I can try before resorting to pg_bouncer, as it's pretty hard to install given my current architecture?

I'm also not sure that my intuition is right and maybe there's something completely different going on altogether; is there some way I can check who's creating the connection, and somehow trace the issue down to the single instances of daphne?

The v4 beta version pip install --pre ... allows you to set an ASGI_THREADS environment variable, so you could set that as max-connections/number of workers maybe 🤔

* Added support for ``ASGI_THREADS`` environment variable, setting the maximum

This may sound like a silly question, but after reading some docs I am having some doubts, so better ask: since Daphne can use multiple threads, is there really an advantage to having more processes of Daphne?

My application usually has at most 400 users connected at the same time; the server on which it runs has 6 cores. Is there a real advantage to having replicas at this traffic volume, or am I better off giving up having multiple processes, which would at least solve this issue?

SO... 😜

The ASGI server itself runs in a single thread — using asymncio to handle multiple tasks concurrently.

But every time you do anything CPU intensive that single thread is blocked. So we hand things off to threads in order to not block the main event loop whilst we're rendering a template, say.

Assuming you don't block it, a single instance can handle a lot of connections. (Implement an echo server and see.)

But what are your views doing?

If you're using the Django ORM then that's run in a thread pool, because the underlying drivers are sync. But the bottleneck here is likely your database rather than the ASGI server in front of it.

Unfortunately there's no simple answer here: you have to profile and see.

It's worth acknowledging that the GIL means only one Python thread can ever be executing at a time. Yes putting a template render in a thread can avoid blocking the event loop for too long, but it will mean the render takes longer, and there's the overhead of switching in/out of the thread.

At some scale you will always need to use more than one process. But it does heavily depend on what you're doing... websocket connections typically take little processing time each, since they can be mostly idle. Rendering web pages takes more processing time. You do have to profile and see.

@adamchainz just to give some context about my application, it's a DRF app so no rendering. I only run a REST API, with a couple WS entry points which are the reason I use daphne.

I'm doing some investigation right now, and setting CONN_MAX_AGE to 0 + django-postgrespool2 gives me a number of connections to the db equal to the number of daphne workers; those connections are not closed even with CONN_MAX_AGE=0.

The fact they aren't closed can probably be explained by the fact that I'm using django-postgrespool2, but the fact that I have as many requests as I do workers is a bit strange, considered I tested this on staging environment where I was the only user, just refreshing the page on my frontend.

If all daphne workers shared the same pool, they'd probably be able to always pick that same one connection opened initially, without having to resort to open a new one, wouldn't they?

a DRF app so no rendering

Serialization is the JSON-API equivalent of template rendering.

I often recommend folks run a WSGI app for the base load, and then have an ASGI "sidecar" for the async views that need it. (You don't have to do it that way but, the scaling patterns are much more well known, and are generally simpler, so you're not left wondering what the right thing to do is, if you're not sure about profiling your application's exact use.)