duplicated records when use insert overwrite

Question

duplicated records when use insert overwrite

njalan opened this issue 2 months ago · comments

There are multiple commit time exists in hoodie table and also duplicated records exists when use insert overwrite into the target table. There are like 10 tables join in the query.

Environment Description

Hudi version : 0.9
Spark version : 3.0.1
Hive version : 3.2
Hadoop version :3.2
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) :no

Aditya Goenka · Answer 1 · Fri May 31 2024 19:16:35 GMT+0800 (China Standard Time)

@njalan Are you using multi writers? Can you come up with a reproducible script. You are using very old Hudi version though.

Aditya Goenka · Answer 2 · Fri May 31 2024 23:10:23 GMT+0800 (China Standard Time)

@njalan Also as I understood, data what you are writing is output of 10 tables. SO when you are doing insert_overwrite, Does that source data frame contains dups?

njalan · Answer 3 · Tue Jun 04 2024 21:40:47 GMT+0800 (China Standard Time)

@ad1happy2go I don't think I am using multi writers. is there any parameter for multi writers? We have checked after that there is dup records. In my understanding that there should me only one commit time in final table when I use insert_overwrite. Why I can see two multiple commit times from the final table and one commit time is that from target table before this overwrite.

Aditya Goenka · Answer 4 · Sat Jun 08 2024 00:09:20 GMT+0800 (China Standard Time)

@njalan If the data which you are inserting has dups, then insert overwrite will create dups in the table.

Can you please share us the timeline to look further