Bytelength

Question

Bytelength

Loganhex2021 opened this issue 3 years ago · comments

Background [Optional]

We are using cobrix library for reading ebcdic file in the Databricks. There is a validation requirement to check record byte length for each record in the file.

Question

Is there any option to generate byte length for the record while reading ebcdic file?

Logan · Answer 1 · Tue Sep 21 2021 14:25:12 GMT+0800 (China Standard Time)

@yruslan - Could you please let me know if you have any idea to calculate byte length for a reach in ebcdic file ?

Ruslan Yushchenko · Answer 2 · Tue Sep 21 2021 20:57:23 GMT+0800 (China Standard Time)

Do you need a record size for each record or file size for each record?

You can get a file name for each record using either

.option("with_input_file_name_col", "input_file_name")

or

df.withColumn("input_file_name", input_file_name())

depending on the type of file (variable length vs fixed length)
You can then use a filesystem API (Hadoop Client, etc) to get the file size for each file.

Logan · Answer 3 · Tue Sep 21 2021 21:01:01 GMT+0800 (China Standard Time)

Thanks @yruslan , I need record size for each record.

Logan · Answer 4 · Mon Sep 27 2021 20:11:14 GMT+0800 (China Standard Time)

@yruslan , could you please help here

Ruslan Yushchenko · Answer 5 · Tue Oct 05 2021 15:33:32 GMT+0800 (China Standard Time)

Hi, sorry for the late reply. Currently, this is not supported. I've added this to feature requests.
We can make

.option("generate_record_id", "true")

generate record length as well.