template file isn't committed and uploaded to storage when using AvroFormat

Question

template file isn't committed and uploaded to storage when using AvroFormat

LeeSzewan opened this issue 3 years ago · comments

Background:

Here is he Connector Config I'm using:

{
 "name": "sink__connector",
 "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
 "tasks.max": "1",
 "store.url": "hdfs://xxxx:8020",
 "topics": "test__xxx_9",

 "format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
 "flush.size": "10000",
 "rotate.interval.ms": "600000",

 "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
 "path.format": "'year'=YYYY/'month'=MM/'day'=dd",
 "partition.duration.ms": "600000",
 "timestamp.extractor": "RecordField",
 "timestamp.field": "XLastUpdated",
 "timezone": "Asia/Shanghai",
 "locale": "zh"
 
 "hadoop.conf.dir": "/etc/hadoop/conf",
 "hadoop.home": "/opt/cloudera/parcels/CDH/lib/hadoop",
 "topics.dir": "/warehouse/tablespace/external/hive/developer.db",
 "logs.dir": "/tmp",
 
 "hive.integration": "true",
 "hive.metastore.uris": "thrift://xxxx:9083",
 "hive.home": "/opt/cloudera/parcels/CDH/lib/hive",
 "hive.database": "developer",
 "hive.table.name": "${topic}",
  "hive.conf.dir": "/etc/hive/conf",
 "schema.compatibility": "BACKWARD",

 "hdfs.authentication.kerberos": "true",
 "connect.hdfs.keytab": "${cm-agent:keytab}",
 "hdfs.namenode.principal": "hdfs/_@HOSTNAME",
 "connect.hdfs.principal": "${cm-agent:ENV:kafka_connect_service_principal}",
}

Expected

When HDFS Sink connector start buffering records, it writes a temp file at /warehouse/tablespace/external/hive/developer.db/+tmp/xxxx_tmp.avro

As I got the information from #566

The temp files are then moved to the final path if one or more of these are true:

flush.size amount of records have been reached in the temp file

rotate.interval.ms was reached

rotate.schedule.interval.ms was reached

record schema was changed

The temp file should be committed and uploaded to storage after 10 minutes because I set "rotate.interval.ms": "600000", no matter how many records arrived in the temp file.

Actually

After opening for 10 minutes, the temp file is still in the +/tmp path. It seems that the connector doesn't commit and upload this temp file.
Here is the log:

2021-11-15 21:03:25,436 INFO io.confluent.connect.hdfs.avro.AvroRecordWriterProvider: Opening record writer for: hdfs://xxx:8020//warehouse/tablespace/external/hive/developer.db//+tmp/test__xxx_9/year=2020/month=12/day=28/hour=16/minute=30/fc9ad086-ffc0-47e9-ab02-036d962a908d_tmp.avro

The temp file has opened since 2021-11-15 21:03:25.

And now is 2021-11-15 21:18, 15 minute has past by, the temp file hasn't been committed yet.

And here is the source data. There're 11 records.

I only got 9 records in HDFS/Hive. The remaining two records are still in the temp file that haven't been committed yet.

PS

When I use ParquetFormat, the temp file will be committed and uploaded to the storage after opening for 10 minutes.

If anybody have any idea about this? Thank you in advance