linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overwrite mode with partitionby option overwrite the entire data instead of the partition of the data

sketchmind opened this issue · comments

spark3.0.1,hadoop3.2 on yarn.
source parquet,deltalake work normally but tfrecord does not match expectations.

Thanks for trying out spark-tfrecord.
Can you provide a code snippet showing two different behaviors, parquet vs tfrecord?

@sheldon1iu from my experience adding the following spark configuration.

'spark.sql.sources.partitionOverwriteMode': 'dynamic',

does the trick.

@junshi15

mycode:

# -*- coding:utf-8 -*-

from pyspark.sql import SparkSession, functions as F

# sparkConf defined on sever,the conf of two examples is same
spark = SparkSession \
    .builder \
    .appName("spark-tfrecord-test") \
    .enableHiveSupport() \
    .getOrCreate()

# make sample data
df = spark.range(3)

for dt in ['2020-12-08', '2020-12-09']:
    sample = df.withColumn("dt", F.lit(dt))
    print(f"show dataframe on date {dt}:")
    sample.show()

    # save sample dataframe use source tfrecord
    sample.repartition(1) \
        .write \
        .partitionBy("dt") \
        .format("tfrecord") \
        .option("recordType", "SequenceExample") \
        .mode("overwrite") \
        .save("hdfs:///data/tfrecord/spark_tfrecord_test")

    # save sample dataframe use source tfrecord
    sample.repartition(1) \
        .write \
        .partitionBy("dt") \
        .format("delta") \
        .mode("overwrite") \
        .save("hdfs:///data/delta/spark_tfrecord_test")

result:
image

image

@zetaatlyft You are right,After setting according to your method, it works normally.thanks a lot!

Thanks @zetaatlyft , I learned a new trick.
@sheldon1iu, I am closing this issue since it is resolved.