Improvements
simplesteph opened this issue · comments
Hi,
We're giving this module a go but I have the following feedback:
-
You're not using kafka boostrap brokers. This is now the standard way to discover all the brokers (only specify 2 or 3 and then the entire cluster is discovered). Adding all the brokers may not be a long term solution because brokers go up or down and IP / DNS are changing. See
pipeline_kafka.add_broker ( hostname text )
-
This API is inherently flawed:
pipeline_kafka.consume_begin ( topic text, stream text, format := ‘text’, delimiter := E’\t’, quote := NULL, escape := NULL, batchsize := 1000, maxbytes := 32000000, parallelism := 1, start_offset := NULL )
. Start offset per topic does not make sense. Offsets are relevant per partition, as admins can add partitions down the road to increase throughtput, and offset 100 in the new partition doesn't really mean anything compared to offset 100 in the other partition -
Kafka 0.10.1 has now the ability to map timestamps to offsets, (see offsetForTimes at https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html). It maps nicely to a where query with a "kafka_timestamp" and would be cool if pipelinedb incorporated that (you'll notice that the API returns a map of offsets per partition).
-
I like the fact you're storing offsets in Pipeline DB instead of Kafka / ZK because it will allow you to achieve exactly once semantics. If you plan on using ZK or Kafka for storing offsets then you're going to be either at least once or at most once, which may be an issue for a database
-
Avro support (integration with the Kafka schema registry).
Let me know your thoughts,
Regards,
Stephane
Hi @simplesteph, thanks for the writeup! We actually have most of this stuff already in mind (see #42). And just to address one of your points:
Start offset per topic does not make sense.
This actually isn't true. start_offset
is for passing special offsets such as 0
(beginning of topic), or -1
(all new messages).
Closing as this is a duplicate of #42.
@simplesteph awesome, thank you! Let's continue the discussion on #42.