dropbox / marshal

A Kafka consumer coordination library for Go.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Marshal consumer decides not to consume ?

porcupie opened this issue · comments

Hello

We are trying to use marshal lib in hopes that we can support kafka topics with multiple partitions. Still testing out with our single partition topic and not sure if this is user-error or something to do with the heartbeat/healthchecking mechanisms in marshal, but we are experiencing an issue where we try to attach a marshal Consumer to a very high traffic topic and it decides it is behind and cannot keep up. Then subsequent attempts to reattach use the stored offset, which seems to max out at 2892349 and it decides that it is out-of-range and abandons the partition. This offset then never changes on subsequent reconnects/release/restarts of the consumers.

Here is the short version of the error we are experiencing (topic: ypec, only 1 partition):

2016/02/18 22:49:38 [ypec:0] consumer attempting to claim
2016/02/18 22:49:39 [ypec:0] consumer claimed at offset 2892349 (is -2827884 behind)
2016/02/18 22:49:39 [ypec:0] error consuming: out of range, abandoning partition
2016/02/18 22:49:39 [ypec:0] releasing partition claim

I tried deleting the __marshal topic and restarting from a fresh install and after recreating it and starting up the marshal/debug client I see similar messages:

2016/02/19 14:53:44 rationalize[0]: starting
2016/02/19 14:53:44 Waiting for all rationalizers to come alive.
2016/02/19 14:53:44 rationalize[0]: offsets 0 to 0
2016/02/19 14:53:44 All rationalizers alive, Marshaler now alive.
2016/02/19 14:53:44 <22.29 ms> construct Marshaler
2016/02/19 14:53:44 Topic ypec has 1 partitions.
2016/02/19 14:53:44 <0.10 ms> construct Consumer
2016/02/19 14:53:45 [ypec:0] consumer offsets: early = 0, cur/comm = 0/2892349, late = 159991
2016/02/19 14:53:45 [ypec:0] recovering committed offset of 2892349
2016/02/19 14:53:45 [ypec:0] consumer attempting to claim
2016/02/19 14:53:45 rationalize[0]: @0: [ClaimingPartition/0/1455922425/de9ca559/debug-client/ypec-logdata/ypec/0]
2016/02/19 14:53:46 [ypec:0] consumer claimed at offset 2892349 (is -2732358 behind)
2016/02/19 14:53:46 [ypec:0] error consuming: out of range, abandoning partition
2016/02/19 14:53:46 <9.57 ms> terminate Consumer
2016/02/19 14:53:46 <2079.58 ms> claim all partitions
2016/02/19 14:53:46 Marshal state dump beginning.
2016/02/19 14:53:46
2016/02/19 14:53:46 Group ID:    ypec-logdata
2016/02/19 14:53:46 Client ID:   debug-client
2016/02/19 14:53:46 Instance ID: de9ca559
2016/02/19 14:53:46
2016/02/19 14:53:46 Marshal topic partitions: 1
2016/02/19 14:53:46 Known Kafka topics:       8
2016/02/19 14:53:46 Internal rsteps counter:  1
2016/02/19 14:53:46
2016/02/19 14:53:46 State of the world:
2016/02/19 14:53:46
2016/02/19 14:53:46   GROUP: ypec-logdata
2016/02/19 14:53:46     TOPIC: ypec [on __marshal:0]
2016/02/19 14:53:46       *  0 [CLMD]: GPID ypec-logdata | CLID debug-client | LHB 1455922425 (1) | LOF 0 | PCL 0
2016/02/19 14:53:46
2016/02/19 14:53:46 Consumer states:
2016/02/19 14:53:46
2016/02/19 14:53:46   CONSUMER: 0 messages in queue
2016/02/19 14:53:46     TOPIC: ypec
2016/02/19 14:53:46       *  0 [CL+T]: offsets 0 <= 2892349 <= 159991 | 2892349
2016/02/19 14:53:46                    BC 0 | LHB 1455922426 (0) | OM 0 | CB 0
2016/02/19 14:53:46                    TRACK COMMITTED 0 | TRACK OUTSTANDING 0
2016/02/19 14:53:46                    PV 0.00 | CV 0.00
2016/02/19 14:53:46
2016/02/19 14:53:46 Marshal state dump complete.
2016/02/19 14:53:46 <0.00 ms> terminate Marshaler

Am I doing something wrong? There is definitely traffic on the ypec topic, as I can see it using kafkacat and the like. I do notice traffic on __marshal topic also, showing a hearbeat from that debug client, here is the total topic contents after recreating fresh and running debug client:

ClaimingPartition/0/1455922425/de9ca559/debug-client/ypec-logdata/ypec/0
Heartbeat/0/1455922425/de9ca559/debug-client/ypec-logdata/ypec/0/2892349

Thanks for the report! I think the key is here:

2016/02/19 14:53:46       *  0 [CL+T]: offsets 0 <= 2892349 <= 159991 | 2892349

It looks like somehow you're getting a committed offset that is well outside the range of offsets that exist in the partition. Do you know if anything untoward happened to your cluster, such as removing data or Zookeeper issues?

In the meantime, it's slightly unclear what the behavior here should be. If it looks like the partition has shrunk, probably the only safe thing is to go back to the earliest offset. Resetting to the latest offset may not be correct. cc @DrTall

Hopefully you have some idea of how you got into this state. If you have any logs from prior Marshal runs (did it ever run successfully?) do you have any log information showing the committed heartbeat offsets?

If you ever have those logs/more information please let us know.