jackc / pgx

PostgreSQL driver and toolkit for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runaway connection attempts with `pgxpool`

mbrancato opened this issue · comments

Describe the bug
I have an application using a pgxpool.Pool (pgx/v4 v4.18.1) to manage connections to a PgBouncer instance, and ultimately to a PostgreSQL 15 server.

To Reproduce
I am currently trying to reproduce this in a simple manner. The connection setup looks like this:

	poolConfig, err := pgxpool.ParseConfig("postgres://postgres:postgres@localhost:5432/postgres")
	if err != nil {
		sugar.Fatalf("error parsing DB URL: %v", err)
	}
	poolConfig.MaxConns = 4
	poolConfig.MinConns = 4
	poolConfig.ConnConfig.Logger = zapadapter.NewLogger(logger)
	poolConfig.ConnConfig.LogLevel = pgx.LogLevelDebug

	dbPool, err := pgxpool.ConnectConfig(ctx, poolConfig)
	if err != nil {
		sugar.Fatalf("error connecting to DB: %v", err)
	}

We're using the QueryRow() etc directly on the pool and not using Acquire(). The resulting Row is passed to Scan() eventually.

Expected behavior
The pool should fill and keep 4 connections.

Actual behavior
The pool fills up, then for some unknown reason, it is attempting to create more connections constantly. The data from pool.Stat() indicates there are already 4 total connections, but continues to attempt to create more.

At the debug log level, pgxpool is not clearly indicating why it is creating multiple connections, but it is easy to see the attempts:
image

Note: the connection failures here are all PgBouncer running out of client connections, which has way more than pool size * number of copies of app.

 server error (FATAL: no more connections allowed (max_client_conn) (SQLSTATE 08P01))

Stat info for the above logs:
before connection runaway (2023-10-26 05:34:14.968):

{
    "max_lifetime_destroy_count": 0,
    "constructing_connections": 0,
    "acquired_connections": 0,
    "empty_acquire_count": 0,
    "new_connections_count": 4,
    "acquire_count": 1,
    "acquire_duration": 2.6e-7,
    "max_idle_destroy_count": 0,
    "total_connections": 4,
    "canceled_acquire_count": 0,
    "message": "Database pool stats",
    "idle_connections": 4,
    "timestamp": "2023-10-26T09:34:14.968Z",
    "max_connections": 4
}

shortly after connection runaway (2023-10-26 05:34:29.969):
Note: I thought this is probably when it received its first query attempt, but no query is logged (which eventually happens with the logger attached)

{
    "idle_connections": 0,
    "canceled_acquire_count": 0,
    "constructing_connections": 0,
    "max_connections": 4,
    "message": "Database pool stats",
    "total_connections": 4,
    "max_lifetime_destroy_count": 0,
    "empty_acquire_count": 5,
    "timestamp": "2023-10-26T09:34:29.969Z",
    "max_idle_destroy_count": 0,
    "acquire_duration": 34.920843834,
    "acquire_count": 10,
    "acquired_connections": 4,
    "new_connections_count": 4
}

1 minute after connection runaway (2023-10-26 05:35:24.975):

{
    "max_lifetime_destroy_count": 0,
    "idle_connections": 0,
    "new_connections_count": 15,
    "max_idle_destroy_count": 0,
    "empty_acquire_count": 299,
    "acquired_connections": 3,
    "acquire_count": 304,
    "message": "Database pool stats",
    "constructing_connections": 0,
    "canceled_acquire_count": 4,
    "timestamp": "2023-10-26T09:35:24.975Z",
    "acquire_duration": 2790.153294167,
    "max_connections": 4,
    "total_connections": 3
}

Version

  • Go: 1.20
  • PostgreSQL: 15 (CloudSQL), with PgBouncer 1.20.1
  • pgx: v4.18.1

Additional context
I started setting MaxConns and MinConns to mitigate this, with no luck.

I haven't seen anything like that before.

A few things to try:

  • Connect directly to the PG server to eliminate PgBouncer as a factor
  • When you have the simpler reproduction case, also try it with v5.

Actually, it doesn't appear that the pool is overfilling. That would indicate that existing connections are dying. Are you canceling contexts? That terminates the underlying connection. Based on your stats only 15 total connections were made. It's quite possible that canceled queries are triggering the new connections.

Thanks for the suggestion @jackc
There was a timeout on the context for pgxpool.ConnectConfig(ctx, poolConfig). I moved it to simply context.Background() but it didn't seem to fix the issue. If a query gets a context timeout (context passed to QueryRow), then immediately after that the pool appears to attempt to create new connections.

The surprising thing is that the connection is orphaned or something when a query times out.

{"severity":"ERROR","timestamp":"2023-10-27T13:30:09.268Z","message":"Query","pid":3727644839,"err":"timeout: context deadline exceeded","sql":"....","time":10.0033073,"args":["..."]}
{"severity":"ERROR","timestamp":"2023-10-27T13:30:09.270Z","message":"Query","err":"timeout: context deadline exceeded","sql":"...","time":10.000524572,"args":["..."],"pid":3765987911}
{"severity":"DEBUG","timestamp":"2023-10-27T13:30:09.271Z","message":"Connecting to database"}
{"severity":"INFO","timestamp":"2023-10-27T13:30:09.271Z","message":"Dialing PostgreSQL server","host":"..."}
{"severity":"DEBUG","timestamp":"2023-10-27T13:30:09.272Z","message":"Connecting to database"}
{"severity":"INFO","timestamp":"2023-10-27T13:30:09.273Z","message":"Dialing PostgreSQL server","host":"..."}
{"severity":"DEBUG","timestamp":"2023-10-27T13:30:09.275Z","message":"Connected to database"}

Looking at this, would a context cancelled or timeout cause a connection to be disconnected? And does it notify the server to disconnect. I ask the latter because if there is a graceful disconnect, I wouldn't expect to get max connection errors.

#506 (comment)

I did notice I'm not seeing the connection Close() being called in the logs emitted, just constant new connection creation.

pgx/conn.go

Line 257 in 13468eb

c.log(ctx, LogLevelInfo, "closed connection", nil)

We've never had an issue with v4 to date, but I might have to switch this over to v5 to see if that helps.

Looking at this, would a context cancelled or timeout cause a connection to be disconnected?

Yes.

And does it notify the server to disconnect. I ask the latter because if there is a graceful disconnect, I wouldn't expect to get max connection errors.

It tries to cancel the running query and it sends the terminate message to the server. But it also closes the local end of the connection immediately.

hey @jackc

I've been able to make some simple examples that show pgxpool attempting to create too many connections. As I worked on this, I started using multiple pgbouncer containers. I thought that may have been the cause due to a lack of peering in the pgbouncer replicas. Ultimately, I was able to see it happening directly against the database server.

I've created two branches to demo this, one for v4 and one for v5. Both seem to trigger the same behavior.

v4: https://github.com/mbrancato/pool-party/tree/pgx/v4/pgxpool
v5: https://github.com/mbrancato/pool-party/tree/pgx/v5/pgxpool

I don't have docker setup on my development machine (I'm on MacOS), but I ran the example against my local database both over TCP and local socket. I wasn't able to duplicate the error.

But it doesn't actually surprise me too much that it could occur when the application pool size is exactly the same as the server max connections. pgxpool considers the slot available for a new connection when the termination message has been sent and the local side of the network connection has been closed. But it may take the server side process a little while to terminate.

I'm using Rancher Desktop as my "Docker" implementation on MacOS as well. Mainly as it provides a quick way to setup multiple things including a TCP loadbalancer and many pgbouncer copies.

I understand the example is using a small connection max count. At scale, I'm seeing these connections runaway at several times the size of the max pool size. I can also duplicate this behavior using pgbouncer in the mix, and it may be that pgbouncer does not try to discard the connection to the client until the client disconnects the socket.

The setup looks like this:

  graph LR;
      pgxpool-->pgbouncer;
      pgbouncer-->pgsql;

Where the pool / max conn values are:
pgxpool: 4
pgbouncer: 40 (max pooled client connections)
pgbouncer: 4 (actual database connections)
pgsql: 40 (max actual client connections)

I do see some of these errors, not many. This indicates the cancel request was sent to the wrong copy of pgbouncer.

pool-party-pgbouncer-2   | 2023-10-31 02:48:13.600 UTC [1] LOG C-0x5588de338460: (nodb)/(nouser)@192.168.224.13:57658 closing because: failed cancel request (age=0s)

And that is likely the case because the cancel request was sent to the wrong copy. I imagine this is because the connection is not re-using an existing socket connection to the DB.

Hmm... I'm not sure. The cancel request has to temporarily make a new connection to issue the cancel request. Not sure if those are counted in PgBouncer or PostgreSQLs max connection counts.

I hope to release v5.5.0 today. It includes a tweak to CancelRequest for PgBouncer (5d0f904). Perhaps that will improve things.

We seem to be running into a similar issue. Recently had a production issue where a single process was able to saturate DB connections (direct to DB, no pgBouncer in between), which is 20x our MaxConns setting.

[edit: seems completely unrelated. mb.]