Overwrite mode with partitionby option overwrite the entire data instead of the partition of the data

Question

Overwrite mode with partitionby option overwrite the entire data instead of the partition of the data

sketchmind opened this issue 4 years ago · comments

spark3.0.1,hadoop3.2 on yarn.
source parquet,deltalake work normally but tfrecord does not match expectations.

Jun Shi · Answer 1 · Wed Dec 09 2020 11:49:40 GMT+0800 (China Standard Time)

Thanks for trying out spark-tfrecord.
Can you provide a code snippet showing two different behaviors, parquet vs tfrecord?

Ricardo Zilleruelo · Answer 2 · Wed Dec 09 2020 13:31:19 GMT+0800 (China Standard Time)

@sheldon1iu from my experience adding the following spark configuration.

'spark.sql.sources.partitionOverwriteMode': 'dynamic',

does the trick.

Sheldon · Answer 3 · Wed Dec 09 2020 18:12:35 GMT+0800 (China Standard Time)

@junshi15

mycode:

# -*- coding:utf-8 -*-

from pyspark.sql import SparkSession, functions as F

# sparkConf defined on sever,the conf of two examples is same
spark = SparkSession \
    .builder \
    .appName("spark-tfrecord-test") \
    .enableHiveSupport() \
    .getOrCreate()

# make sample data
df = spark.range(3)

for dt in ['2020-12-08', '2020-12-09']:
    sample = df.withColumn("dt", F.lit(dt))
    print(f"show dataframe on date {dt}:")
    sample.show()

    # save sample dataframe use source tfrecord
    sample.repartition(1) \
        .write \
        .partitionBy("dt") \
        .format("tfrecord") \
        .option("recordType", "SequenceExample") \
        .mode("overwrite") \
        .save("hdfs:///data/tfrecord/spark_tfrecord_test")

    # save sample dataframe use source tfrecord
    sample.repartition(1) \
        .write \
        .partitionBy("dt") \
        .format("delta") \
        .mode("overwrite") \
        .save("hdfs:///data/delta/spark_tfrecord_test")

result:

Sheldon · Answer 4 · Wed Dec 09 2020 18:12:56 GMT+0800 (China Standard Time)

@zetaatlyft You are right,After setting according to your method, it works normally.thanks a lot!

Jun Shi · Answer 5 · Thu Dec 10 2020 00:54:25 GMT+0800 (China Standard Time)

Thanks @zetaatlyft , I learned a new trick.
@sheldon1iu, I am closing this issue since it is resolved.