Overwrite mode with partitionby option overwrite the entire data instead of the partition of the data
sketchmind opened this issue · comments
spark3.0.1,hadoop3.2 on yarn.
source parquet,deltalake work normally but tfrecord does not match expectations.
Thanks for trying out spark-tfrecord.
Can you provide a code snippet showing two different behaviors, parquet vs tfrecord?
@sheldon1iu from my experience adding the following spark configuration.
'spark.sql.sources.partitionOverwriteMode': 'dynamic',
does the trick.
mycode:
# -*- coding:utf-8 -*-
from pyspark.sql import SparkSession, functions as F
# sparkConf defined on sever,the conf of two examples is same
spark = SparkSession \
.builder \
.appName("spark-tfrecord-test") \
.enableHiveSupport() \
.getOrCreate()
# make sample data
df = spark.range(3)
for dt in ['2020-12-08', '2020-12-09']:
sample = df.withColumn("dt", F.lit(dt))
print(f"show dataframe on date {dt}:")
sample.show()
# save sample dataframe use source tfrecord
sample.repartition(1) \
.write \
.partitionBy("dt") \
.format("tfrecord") \
.option("recordType", "SequenceExample") \
.mode("overwrite") \
.save("hdfs:///data/tfrecord/spark_tfrecord_test")
# save sample dataframe use source tfrecord
sample.repartition(1) \
.write \
.partitionBy("dt") \
.format("delta") \
.mode("overwrite") \
.save("hdfs:///data/delta/spark_tfrecord_test")
@zetaatlyft You are right,After setting according to your method, it works normally.thanks a lot!
Thanks @zetaatlyft , I learned a new trick.
@sheldon1iu, I am closing this issue since it is resolved.