BroadwayKafka 0.3.1 is skipping the current offset

Question

BroadwayKafka 0.3.1 is skipping the current offset

alexandrexaviersm opened this issue 2 years ago · comments

Alexandre Moreira Xavier commented 2 years ago

Hi folks 👋

BroadwayKafka version 0.3.1 (specifically this PR #72) introduced an undesirable behavior where the consumer doesn't go back to consuming the offsets from where it left off, now it's always taking the lastest offset. I think this is a Major issue because you can end up losing a lot of messages if your consumer is restarted.

For example, suppose you have a constant flow of messages, let's say that a topic receives about 10 messages per second and you decide to stop the server to make a deployment. The desired behavior is: When it becomes active again, the consumer starts consuming since the last committed offset, so it would continue the flow normally without losing any message. But with the behavior introduced in 0.3.1, the current_offset (last committed offset) is ignored and instead, we are only reading the new messages that arrive after the consumer is active again.
In this example, If the consumer takes 10 seconds to come back up again after the deployment, you will have missed 100 messages.

Maybe there was a misinterpretation of the configuration offset_reset_policy and now we're using it in cases we shouldn't.

:offset_reset_policy - Optional. Defines the offset to be used when there's no initial offset in Kafka or if the current offset has expired.
Possible values are :earliest or :latest. Default is :latest.

As shown in the docs, I believe we should use this policy only when the offset is :undefined (new consumers) or the current_offset is already expired. If your application already knows the offset it should use and it is still active, then I think it's wrong using the :latest or :erliest offset option. I think this undesirable behavior is also related to this issue #74

I created the PR #75 and I think it should fix the problem that we were trying to fix in the issue #71, without introducing the side effects that I described here and also in the one described in the issue #74.

How to reproduce:

After initializing Kafka, create a topic

kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --topic test --replication-factor 1

Starting a new project

mix new kafka_consumer --sup

  defp deps do
    [
      {:broadway, "~> 1.0"},
      {:broadway_kafka, "~> 0.3"}
    ]
  end

Define a basic pipeline configuration

defmodule MyBroadway do
  use Broadway

  alias Broadway.Message

  def start_link(_opts) do
    Broadway.start_link(__MODULE__,
      name: __MODULE__,
      producer: [
        module:
          {BroadwayKafka.Producer,
           [
             hosts: [localhost: 9092],
             group_id: "group_1",
             topics: ["test"]
           ]},
        concurrency: 1
      ],
      processors: [
        default: [
          concurrency: 10
        ]
      ]
    )
  end

  @impl true
  def handle_message(_, message, _) do
    message
    |> Message.update_data(fn data ->
      IO.inspect(data, label: "Got messsage")

      {data, String.to_integer(data) * 2}
    end)
  end
end

Add it as a child in a supervision tree

    children = [MyBroadway]

    Supervisor.start_link(children, strategy: :one_for_one)

You can now test the pipeline by entering an iex session:

iex -S mix

Open another terminal window and send messages to Kafka

kafka-console-producer.sh --topic test --bootstrap-server localhost:9092
>1
>2
>3

You should see this output

iex> Got messsage: "1"
iex> Got messsage: "2"
iex> Got messsage: "3"

Now hit Ctrl-C twice to stop the broadway consumer and send more messages to kafka:

kafka-console-producer.sh --topic test --bootstrap-server localhost:9092
>4
>5
>6

Start your Elixir application again:

iex -S mix

You can wait for a while, but new messages that were sent while the consumer was offline will not be consumed.

Try to send a new message:

kafka-console-producer.sh --topic test --bootstrap-server localhost:9092

7

You should see this output

iex> Got messsage: "7"

This means that offsets 3, 4, and 5 were skipped

The desired behavior for a kafka consumer is that it doesn't skip any available messages, so if the last ack was offset 2, it would have to continue from offset 3 when you start it again, and it would have to consume messages received while it was offline.