pipelinedb / pipeline_kafka

PipelineDB extension for Kafka support

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deadlock in consume_begin

AndreyNudko opened this issue · comments

I have a maintenance script which periodically restarts several consumers from latest kafka offset:

psql -c 'SELECT pipeline_kafka.consume_end();'
psql -c "SELECT pipeline_kafka.consume_begin(..., start_offset := -1);"
psql -c "SELECT pipeline_kafka.consume_begin(..., start_offset := -1);"
# more like that for different topics

Sometimes this results in the following:

2016-10-12 00:00:05.595 UTC [13961] [local] 40P01 ERROR:  deadlock detected
2016-10-12 00:00:05.596 UTC [13961] [local] 40P01 DETAIL:  Process 13961 waits for ExclusiveLock on relation 16388 of database 13527; blocked by process 13958.
        Process 13958 waits for ExclusiveLock on relation 16410 of database 13527; blocked by process 13961.
        Process 13961: SELECT pipeline_kafka.consume_begin(...);
        Process 13958: <command string not enabled>

13961:

  • holds ExclusiveLock on brokers
  • tries to get ExclusivLock on consumers (from consume_begin)

13958:

  • holds ExclusivLock on consumers
  • tries to acquire ExclusivLock on brokers

After browsing through code I think the problem is between kafka_consume_main (consumer started by first command) and kafka_consume_begin (second command).

kafka_consume_main calls load_consumer_state which does:

  // ExclusiveLock on consumers
  ResultRelInfo *consumers = relinfo_open(get_rangevar(CONSUMER_RELATION), ExclusiveLock);
  ...
  // ExclusiveLock on brokers, apparently not released by relinfo_close
  consumer->brokers = get_all_brokers();

kafka_consume_begin:

  // ExclusiveLock on brokers, apparently not released by relinfo_close
  if (!get_all_brokers())
   ...
  // ExclusiveLock on consumers
  consumers = relinfo_open(get_rangevar(CONSUMER_RELATION), ExclusiveLock);

The workaround is to put sleep between pipeline_kafka.consume_begin

We should probably downgrade the lock for get_all_brokers() to be an AccessShareLock. I'll send out a fix today.