open-contracting / deploy

Deployment configuration and scripts

Home Page:https://ocdsdeploy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate PostgreSQL connection errors

jpmckinney opened this issue · comments

On ocp13, there are log entries in /var/log/postgresql/postgresql-12-main.log like:

2024-01-20 22:05:47.271 UTC [2001370] [unknown]@[unknown] LOG:  could not accept SSL connection: Success
2024-01-20 22:05:47.273 UTC [2001383] kingfisher_process@kingfisher_process FATAL:  no pg_hba.conf entry for host "172.18.0.7", user "kingfisher_process", database "kingfisher_process", SSL off

These coincide with errors in Sentry for kingfisher-process like:

connection to server at "host.docker.internal" (172.17.0.1), port 5432 failed: SSL error: unsupported method
connection to server at "host.docker.internal" (172.17.0.1), port 5432 failed: FATAL:  no pg_hba.conf entry for host "172.18.0.7", user "kingfisher_process", database "kingfisher_process", SSL off

The connection string for the app is postgresql://kingfisher_process:PASSWORD@host.docker.internal:5432/kingfisher_process

Does /etc/postgresql/12/main/pg_hba.conf need to be changed? Its template is https://github.com/open-contracting/deploy/blob/main/salt/postgres/files/pg_hba.conf#L5-L18

I have just tested psql connections from a container on ocp13 and I can connect successfully.
The Docker IP range is not explicitly allowed in pg_hba.conf however postgres.public_access is enabled on this server which allows these connections. I would also expect to see all connections fail if pg_hba.conf was the cause - this sounds like an intermittent issue.

These connections are failing is because of an SSL error (SSL error: unsupported method), when this happens it then falls back to "SSL disabled" which is blocked by pg_hba.conf.

One explanation could be resource exhaustion / server load, do these errors line up with alerts in Prometheus?

I checked the Prometheus graphs for 2024-01-20, and it looks like normal load.

It does seem like an intermittent issue. We had seen it a few times, but not again since I reported this issue. I'll close it for now. We can re-open if it happens again (as we'll have corresponding logs to investigate with).

I can confirm that it's happening presently and all 16 cores are at 100%.

Edit: I notice I can possibly save 50% of the queries from Kingfisher Process, so will deploy that in a bit. open-contracting/kingfisher-process@122f928

Edit2: By looking at the active queries, I saw another spot where the same query was being made repetitively. This was a bit trickier to fix and might leave a message stuck in the queue. open-contracting/kingfisher-process@8125a3b

Okay, load isn't 100% on all CPUs now, so I'll re-close.