Improvements

Question

Improvements

simplesteph opened this issue 8 years ago · comments

Stephane Maarek commented 8 years ago

Hi,

We're giving this module a go but I have the following feedback:

You're not using kafka boostrap brokers. This is now the standard way to discover all the brokers (only specify 2 or 3 and then the entire cluster is discovered). Adding all the brokers may not be a long term solution because brokers go up or down and IP / DNS are changing. See pipeline_kafka.add_broker ( hostname text )
This API is inherently flawed: pipeline_kafka.consume_begin ( topic text, stream text, format := ‘text’, delimiter := E’\t’, quote := NULL, escape := NULL, batchsize := 1000, maxbytes := 32000000, parallelism := 1, start_offset := NULL ). Start offset per topic does not make sense. Offsets are relevant per partition, as admins can add partitions down the road to increase throughtput, and offset 100 in the new partition doesn't really mean anything compared to offset 100 in the other partition
Kafka 0.10.1 has now the ability to map timestamps to offsets, (see offsetForTimes at https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html). It maps nicely to a where query with a "kafka_timestamp" and would be cool if pipelinedb incorporated that (you'll notice that the API returns a map of offsets per partition).
I like the fact you're storing offsets in Pipeline DB instead of Kafka / ZK because it will allow you to achieve exactly once semantics. If you plan on using ZK or Kafka for storing offsets then you're going to be either at least once or at most once, which may be an issue for a database
Avro support (integration with the Kafka schema registry).

Let me know your thoughts,

Regards,
Stephane

Derek Nelson · Answer 1 · Wed Dec 21 2016 00:46:53 GMT+0800 (China Standard Time)

Hi @simplesteph, thanks for the writeup! We actually have most of this stuff already in mind (see #42). And just to address one of your points:

Start offset per topic does not make sense.

This actually isn't true. start_offset is for passing special offsets such as 0 (beginning of topic), or -1 (all new messages).

Closing as this is a duplicate of #42.

Stephane Maarek · Answer 2 · Wed Dec 21 2016 08:18:51 GMT+0800 (China Standard Time)

Thanks @derekjn for the clarification

I believe the :

third bullet point I don't see mentioned in #42
fourth bullet point you're proposing to store offsets in Kafka which I'm a little bit against.
fifth bullet point I don't see it mentioned in #42

I'll add them to the issue there for issue tracking

Derek Nelson · Answer 3 · Wed Dec 21 2016 09:42:50 GMT+0800 (China Standard Time)

@simplesteph awesome, thank you! Let's continue the discussion on #42.