confluentinc / kafka-connect-hdfs

Kafka Connect HDFS connector

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

template file isn't committed and uploaded to storage when using AvroFormat

LeeSzewan opened this issue · comments

Background:

Here is he Connector Config I'm using:

{
 "name": "sink__connector",
 "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
 "tasks.max": "1",
 "store.url": "hdfs://xxxx:8020",
 "topics": "test__xxx_9",

 "format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
 "flush.size": "10000",
 "rotate.interval.ms": "600000",

 "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
 "path.format": "'year'=YYYY/'month'=MM/'day'=dd",
 "partition.duration.ms": "600000",
 "timestamp.extractor": "RecordField",
 "timestamp.field": "XLastUpdated",
 "timezone": "Asia/Shanghai",
 "locale": "zh"
 
 "hadoop.conf.dir": "/etc/hadoop/conf",
 "hadoop.home": "/opt/cloudera/parcels/CDH/lib/hadoop",
 "topics.dir": "/warehouse/tablespace/external/hive/developer.db",
 "logs.dir": "/tmp",
 
 "hive.integration": "true",
 "hive.metastore.uris": "thrift://xxxx:9083",
 "hive.home": "/opt/cloudera/parcels/CDH/lib/hive",
 "hive.database": "developer",
 "hive.table.name": "${topic}",
  "hive.conf.dir": "/etc/hive/conf",
 "schema.compatibility": "BACKWARD",

 "hdfs.authentication.kerberos": "true",
 "connect.hdfs.keytab": "${cm-agent:keytab}",
 "hdfs.namenode.principal": "hdfs/_@HOSTNAME",
 "connect.hdfs.principal": "${cm-agent:ENV:kafka_connect_service_principal}",
}

Expected

When HDFS Sink connector start buffering records, it writes a temp file at /warehouse/tablespace/external/hive/developer.db/+tmp/xxxx_tmp.avro

As I got the information from #566

The temp files are then moved to the final path if one or more of these are true:

  • flush.size amount of records have been reached in the temp file
  • rotate.interval.ms was reached
  • rotate.schedule.interval.ms was reached
  • record schema was changed

The temp file should be committed and uploaded to storage after 10 minutes because I set "rotate.interval.ms": "600000", no matter how many records arrived in the temp file.

Actually

After opening for 10 minutes, the temp file is still in the +/tmp path. It seems that the connector doesn't commit and upload this temp file.
Here is the log:

2021-11-15 21:03:25,436 INFO io.confluent.connect.hdfs.avro.AvroRecordWriterProvider: Opening record writer for: hdfs://xxx:8020//warehouse/tablespace/external/hive/developer.db//+tmp/test__xxx_9/year=2020/month=12/day=28/hour=16/minute=30/fc9ad086-ffc0-47e9-ab02-036d962a908d_tmp.avro

The temp file has opened since 2021-11-15 21:03:25.

And now is 2021-11-15 21:18, 15 minute has past by, the temp file hasn't been committed yet.
image

And here is the source data. There're 11 records.
image
I only got 9 records in HDFS/Hive. The remaining two records are still in the temp file that haven't been committed yet.
image

PS

When I use ParquetFormat, the temp file will be committed and uploaded to the storage after opening for 10 minutes.

If anybody have any idea about this? Thank you in advance