Deadlock in consume_begin
AndreyNudko opened this issue · comments
AndreyNudko commented
I have a maintenance script which periodically restarts several consumers from latest kafka offset:
psql -c 'SELECT pipeline_kafka.consume_end();'
psql -c "SELECT pipeline_kafka.consume_begin(..., start_offset := -1);"
psql -c "SELECT pipeline_kafka.consume_begin(..., start_offset := -1);"
# more like that for different topics
Sometimes this results in the following:
2016-10-12 00:00:05.595 UTC [13961] [local] 40P01 ERROR: deadlock detected
2016-10-12 00:00:05.596 UTC [13961] [local] 40P01 DETAIL: Process 13961 waits for ExclusiveLock on relation 16388 of database 13527; blocked by process 13958.
Process 13958 waits for ExclusiveLock on relation 16410 of database 13527; blocked by process 13961.
Process 13961: SELECT pipeline_kafka.consume_begin(...);
Process 13958: <command string not enabled>
13961:
- holds ExclusiveLock on brokers
- tries to get ExclusivLock on consumers (from consume_begin)
13958:
- holds ExclusivLock on consumers
- tries to acquire ExclusivLock on brokers
After browsing through code I think the problem is between kafka_consume_main (consumer started by first command) and kafka_consume_begin (second command).
kafka_consume_main calls load_consumer_state which does:
// ExclusiveLock on consumers
ResultRelInfo *consumers = relinfo_open(get_rangevar(CONSUMER_RELATION), ExclusiveLock);
...
// ExclusiveLock on brokers, apparently not released by relinfo_close
consumer->brokers = get_all_brokers();
kafka_consume_begin:
// ExclusiveLock on brokers, apparently not released by relinfo_close
if (!get_all_brokers())
...
// ExclusiveLock on consumers
consumers = relinfo_open(get_rangevar(CONSUMER_RELATION), ExclusiveLock);
The workaround is to put sleep between pipeline_kafka.consume_begin
Usman Masood commented
We should probably downgrade the lock for get_all_brokers()
to be an AccessShareLock
. I'll send out a fix today.