template file isn't committed and uploaded to storage when using AvroFormat
LeeSzewan opened this issue · comments
Background:
Here is he Connector Config I'm using:
{
"name": "sink__connector",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"store.url": "hdfs://xxxx:8020",
"topics": "test__xxx_9",
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
"flush.size": "10000",
"rotate.interval.ms": "600000",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd",
"partition.duration.ms": "600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "XLastUpdated",
"timezone": "Asia/Shanghai",
"locale": "zh"
"hadoop.conf.dir": "/etc/hadoop/conf",
"hadoop.home": "/opt/cloudera/parcels/CDH/lib/hadoop",
"topics.dir": "/warehouse/tablespace/external/hive/developer.db",
"logs.dir": "/tmp",
"hive.integration": "true",
"hive.metastore.uris": "thrift://xxxx:9083",
"hive.home": "/opt/cloudera/parcels/CDH/lib/hive",
"hive.database": "developer",
"hive.table.name": "${topic}",
"hive.conf.dir": "/etc/hive/conf",
"schema.compatibility": "BACKWARD",
"hdfs.authentication.kerberos": "true",
"connect.hdfs.keytab": "${cm-agent:keytab}",
"hdfs.namenode.principal": "hdfs/_@HOSTNAME",
"connect.hdfs.principal": "${cm-agent:ENV:kafka_connect_service_principal}",
}
Expected
When HDFS Sink connector start buffering records, it writes a temp file at /warehouse/tablespace/external/hive/developer.db/+tmp/xxxx_tmp.avro
As I got the information from #566
The temp files are then moved to the final path if one or more of these are true:
flush.size
amount of records have been reached in the temp filerotate.interval.ms
was reachedrotate.schedule.interval.ms
was reached- record schema was changed
The temp file should be committed and uploaded to storage after 10 minutes because I set "rotate.interval.ms": "600000", no matter how many records arrived in the temp file.
Actually
After opening for 10 minutes, the temp file is still in the +/tmp path. It seems that the connector doesn't commit and upload this temp file.
Here is the log:
2021-11-15 21:03:25,436 INFO io.confluent.connect.hdfs.avro.AvroRecordWriterProvider: Opening record writer for: hdfs://xxx:8020//warehouse/tablespace/external/hive/developer.db//+tmp/test__xxx_9/year=2020/month=12/day=28/hour=16/minute=30/fc9ad086-ffc0-47e9-ab02-036d962a908d_tmp.avro
The temp file has opened since 2021-11-15 21:03:25.
And now is 2021-11-15 21:18, 15 minute has past by, the temp file hasn't been committed yet.
And here is the source data. There're 11 records.
I only got 9 records in HDFS/Hive. The remaining two records are still in the temp file that haven't been committed yet.
PS
When I use ParquetFormat
, the temp file will be committed and uploaded to the storage after opening for 10 minutes.
If anybody have any idea about this? Thank you in advance