apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.

Home Page:https://hudi.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

duplicated records when use insert overwrite

njalan opened this issue · comments

There are multiple commit time exists in hoodie table and also duplicated records exists when use insert overwrite into the target table. There are like 10 tables join in the query.

Environment Description

  • Hudi version : 0.9

  • Spark version : 3.0.1

  • Hive version : 3.2

  • Hadoop version :3.2

  • Storage (HDFS/S3/GCS..) : s3

  • Running on Docker? (yes/no) :no

@njalan Are you using multi writers? Can you come up with a reproducible script. You are using very old Hudi version though.

@njalan Also as I understood, data what you are writing is output of 10 tables. SO when you are doing insert_overwrite, Does that source data frame contains dups?

@ad1happy2go I don't think I am using multi writers. is there any parameter for multi writers? We have checked after that there is dup records. In my understanding that there should me only one commit time in final table when I use insert_overwrite. Why I can see two multiple commit times from the final table and one commit time is that from target table before this overwrite.

@njalan If the data which you are inserting has dups, then insert overwrite will create dups in the table.

Can you please share us the timeline to look further