High Empty Acquire/Acquire Duration

Question

High Empty Acquire/Acquire Duration

basilveerman opened this issue 6 months ago · comments

Describe the bug
I see periodic instances where pgxpool is recording excessive Acquire Duration and Empty Acquire. This results in increased application latency. K8s Pods that get into this state will experience this increased latency until they are terminated/cycled or traffic on the affected pod is reduced below the need for 1-2 concurrent connections. The application is a thin API that translates REST requests into queries to a cluster a AWS Aurora read-only replicas.

App:

pgx/pgxpool v5.4.3
200-600 requests per second per pod, somewhat bursty during scaling events with 1-10 concurrent requests
P50 of DB requests is 2ms, P95 7-10ms.
All DB requests use pgxpool.Query/QueryRow -> Scan
pgxpool config:
- MaxConnLifetime: "10m"
- MaxConnLifetimeJitter: "1m"
- MaxConnIdleTime: "5m"
- MaxConns: 32
- MinConns: 4
Conn lifetime is low with high jitter on reconnect to push towards randomization of host selection on AWS Aurora backend (Aurora RO dns lookup is cycled between each replica every 5s). We have work in flight to provide a custom LookupFunc and cache/periodically refresh DB replica addrs and randomize instance selection on connection creation.

To Reproduce
I have been unable to reproduce in non-production environments. Load tests from 50-4000 qps per pod show great performance, however load test traffic is not an exact representation of production load.

Expected behavior

Actual behavior

Version

Go: go version go1.19.13 linux/amd64
PostgreSQL: PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit
pgx: github.com/jackc/pgx/v5 v5.4.3

I have traced the code paths in pgx/pool.go and puddle/pool.go, but have not seen any explicit issues locking/tracking connections, so I'm not sure if this is an actual bug or an issue with our configuration. Any help would be appreciated.

Thanks!

Basil Veerman · Answer 1 · Thu Feb 08 2024 05:42:36 GMT+0800 (China Standard Time)

I'd also be happy to work on an PR for this if needed, but so far haven't been able to determine the root cause. Any pointers on where to look would be helpful.

Basil Veerman · Answer 2 · Tue Feb 13 2024 03:41:01 GMT+0800 (China Standard Time)

I've investigate further and it appears that this may be due to traffic patterns. All our metrics are at best 1s monitoring intervals. It appears we may have microbursts (within tens of nanoseconds) of traffic that never push prometheus/cloudfront reported resource usage in a monitoring period over any thresholds, but those requests currently queue up/wait at the API. If we increase connection pool size, sending them directly to the DB results in overloading the active session count which slows down all requests currently in flight.