Small object files problem for multi schema for a single topic

Question

Small object files problem for multi schema for a single topic

gokhansari opened this issue 3 years ago · comments

I was trying to convert avro files to parquet and sink them by using kafka-connect-hdfs connector. According to this confluent blog, It is possible to go with multiple schema for a single topic.

Firstly I tried schema references approach which is recommended, but I hit partitioning problem on kafka connect. After Avro Schema to Connect Schema conversion process I noticed there is a wrapper struct model which kafka partitioner can not parse dynamically for multiple schema type. I opened this issue on kafka-connect-storage-common.

Then I decided to go with an other approach. Using TopicRecordNameStrategy to provide multiple schema support. After a few try I noticed rotation strategies does not work properly. Almost each file has one or two messages on HDFS. There were a lot of small object files on HDFS. Something was breaking rotation strategy.
Sadly I saw these lines in documentation:

Schema evolution only works if the records are generated with the default naming strategy, which is TopicNameStrategy. An error may occur if other naming strategies are used. This is because records are not compatible with each other. schema.compatibility should be set to NONE if other naming strategies are used. This may result in small object files because the sink connector creates a new file every time the schema ID changes between records.

Having these small parquet file objects are not appropriate for the nature/purpose of parquet format and It is not easy to process these parquet files when you have enough high traffic.

In the end, I could not find any proper way for persisting messages to HDFS over Kafka Connect when there are multiple schema types for a single topic.

Any suggestion or idea? I will be appreciated for your answers.