Investigate PostgreSQL connection errors

Question

Investigate PostgreSQL connection errors

jpmckinney opened this issue 7 months ago · comments

On ocp13, there are log entries in /var/log/postgresql/postgresql-12-main.log like:

2024-01-20 22:05:47.271 UTC [2001370] [unknown]@[unknown] LOG:  could not accept SSL connection: Success
2024-01-20 22:05:47.273 UTC [2001383] kingfisher_process@kingfisher_process FATAL:  no pg_hba.conf entry for host "172.18.0.7", user "kingfisher_process", database "kingfisher_process", SSL off

These coincide with errors in Sentry for kingfisher-process like:

connection to server at "host.docker.internal" (172.17.0.1), port 5432 failed: SSL error: unsupported method
connection to server at "host.docker.internal" (172.17.0.1), port 5432 failed: FATAL:  no pg_hba.conf entry for host "172.18.0.7", user "kingfisher_process", database "kingfisher_process", SSL off

The connection string for the app is postgresql://kingfisher_process:PASSWORD@host.docker.internal:5432/kingfisher_process

Does /etc/postgresql/12/main/pg_hba.conf need to be changed? Its template is https://github.com/open-contracting/deploy/blob/main/salt/postgres/files/pg_hba.conf#L5-L18

Bob H. · Answer 1 · Tue May 07 2024 17:29:23 GMT+0800 (China Standard Time)

I have just tested psql connections from a container on ocp13 and I can connect successfully.
The Docker IP range is not explicitly allowed in pg_hba.conf however postgres.public_access is enabled on this server which allows these connections. I would also expect to see all connections fail if pg_hba.conf was the cause - this sounds like an intermittent issue.

These connections are failing is because of an SSL error (SSL error: unsupported method), when this happens it then falls back to "SSL disabled" which is blocked by pg_hba.conf.

One explanation could be resource exhaustion / server load, do these errors line up with alerts in Prometheus?

James McKinney · Answer 2 · Tue May 07 2024 22:43:51 GMT+0800 (China Standard Time)

I checked the Prometheus graphs for 2024-01-20, and it looks like normal load.

It does seem like an intermittent issue. We had seen it a few times, but not again since I reported this issue. I'll close it for now. We can re-open if it happens again (as we'll have corresponding logs to investigate with).

James McKinney · Answer 3 · Wed Jul 03 2024 02:16:50 GMT+0800 (China Standard Time)

@RobHooper This occurred yesterday on ocp13. You can see the "OperationalError" issues in Sentry for the registry-kingfisher-process project: https://open-contracting-partnership.sentry.io/issues/?notification_uuid=45e0f203-56e9-4e7e-a9b4-6211d0174981&project=5857059&query=is%3Aunresolved&referrer=issue-list&statsPeriod=14d

James McKinney · Answer 4 · Wed Jul 03 2024 05:39:05 GMT+0800 (China Standard Time)

I can confirm that it's happening presently and all 16 cores are at 100%.

Edit: I notice I can possibly save 50% of the queries from Kingfisher Process, so will deploy that in a bit. open-contracting/kingfisher-process@122f928

Edit2: By looking at the active queries, I saw another spot where the same query was being made repetitively. This was a bit trickier to fix and might leave a message stuck in the queue. open-contracting/kingfisher-process@8125a3b

James McKinney · Answer 5 · Wed Jul 03 2024 12:04:07 GMT+0800 (China Standard Time)

Okay, load isn't 100% on all CPUs now, so I'll re-close.