documentcloud / cloud-crowd

Parallel Processing for the Rest of Us

Home Page:https://github.com/documentcloud/cloud-crowd/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Strange error has caused server to stop functioning

antunderwood opened this issue · comments

I had a problem recently ( has occurred before I think) where the server stops accepting connections. In the server log

WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted sh
ared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
!! Unexpected error while processing request: PGError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
: SELECT * FROM "node_records" WHERE ("node_records"."host" = E'q1.bioinformatics:9063') LIMIT 1
!! Unexpected error while processing request: PGError: result has been cleared: SELECT * FROM "node_records" WHERE ("node_records"."host" = E'q1.bioinformatics:9063') LIMIT 1
!! Unexpected error while processing request: PGError: result has been cleared: SELECT * FROM "node_records" WHERE ("node_records"."host" = E'q1.bioinformatics:9063') LIMIT 1
------- many similar lines ------
!! Unexpected error while processing request: PGError: result has been cleared: SELECT * FROM "node_records" WHERE ("node_records"."host" = E'q1.bioinformatics:9063') LIMIT 1
!! Unexpected error while processing request: PGError: result has been cleared: BEGIN
!! Unexpected error while processing request: PGError: result has been cleared: SELECT * FROM "node_records" WHERE ("node_records"."host" = E'q1.bioinformatics:9063') LIMIT 1
!! Unexpected error while processing request: PGError: result has been cleared: SELECT * FROM "node_records" ORDER BY host desc

and so on

In the node.log
Failed to connect to the central server (http://158.119.147.51:9173).
Failed to connect to the central server (http://158.119.147.51:9173).
and so on

This has left me with a job that appears in the operations centre that will not go away even with crowd cleanup --days 0

Many thanks in advance if you can help with this

Anthony

I'm not sure quite what happened here, but it sounds like the database got shut down ... perhaps by the OOM killer or some such.

If you'd like to go in to CloudCrowd and clean up jobs and work units manually, you can always use crowd console.