IndexBuilder step running in single thread for ASCII variable length files

Question

IndexBuilder step running in single thread for ASCII variable length files

saikumare-a opened this issue a year ago · comments

Background [Optional]

Platform: azure Databricks with file on azure storage
Cluster: 1 driver ( 16 core) + 3 workers (each with 16 core)
Spark API: pyspark

Cobrix:
using cobrix: spark-cobol_2.12

Code:
df = spark.read.format("cobol")
.option("copybook", copybook_path)
.option("encoding", "ascii")
.option("record_format", "D2")
.load(src_file_path)

df.write.parquet("output_path")

Scenario:
we have 300+ GB variable width variable length with multi-segmented file containing more than 2000 columns , file contains 13+ M records and input file is a text file (CRLF / LF are used to split records) .

I am trying to write to parquet file.

indexBuilder stage is running in single partition ( 1 core of single worker node) and taking more than 2 hours
2)after index build, writing parquet file is completing in 30 mins using multiple partitions & all worker cores/threads.

able to see data correctly in parquet file

Question

How to do I parallelize job across executors for indexBuilder step also?

Ruslan Yushchenko · Answer 1 · Mon Dec 12 2022 17:42:41 GMT+0800 (China Standard Time)

Hi, @saikumare-a ,

The index building can't be parallelized for variable-length files. This is because we don't know where each record begins and ends given a record number. That's why index builder is there in the first place - to prepare the file for parallel processing.

If you want to improve performance, your options are:

Use multiple input files, not just one big file. Multiple files are going to be processed in parallel.
Use fixed-length format. This way the location of each record is known so it can run in parallel.

saikumare-a · Answer 2 · Mon Dec 12 2022 19:04:30 GMT+0800 (China Standard Time)

Hi @yruslan,

Thank you for the reply.

The files are in ASCII encoding and each record is in single line ( CR/LF separated). each record is of variable length with multiple segments (each segment has variable length and making overall record as variable length)

is there any other opportunity as each record is in single line only

Ruslan Yushchenko · Answer 3 · Mon Dec 12 2022 23:45:02 GMT+0800 (China Standard Time)

For ASCII files there is an extension that can skip indexing.

.option("record_format", "D2")

But it can only be used if record id and segment id generation is not used.

saikumare-a · Answer 4 · Tue Dec 13 2022 19:00:40 GMT+0800 (China Standard Time)

Hi @yruslan ,

tried "D2" record format ( without record id and segment id generation) for ASCII files, but still see first stage as index building.

Ruslan Yushchenko · Answer 5 · Fri Dec 16 2022 00:01:19 GMT+0800 (China Standard Time)

What's your exact spark.read code snippet?

saikumare-a · Answer 6 · Fri Dec 16 2022 17:13:18 GMT+0800 (China Standard Time)

Hi @yruslan , below is the code snippet we are using
df = spark.read.format("cobol")
.option("copybook","<copybook_path>")
.option("generate_record_id","false")
.option("drop_value_fillers","false")
.option("drop_group_fillers","false")
.option("pedantic","true")
.option("encoding","ascii")
.option("variable_size_occurs","true")
.option("record_format","D2")
.load(<file_path>)

Ruslan Yushchenko · Answer 7 · Fri Dec 16 2022 22:39:10 GMT+0800 (China Standard Time)

This option is incompatible with D2, it switches to D

.option("variable_size_occurs","true")

I'm not quite sure why it is like that. It should be compatible. Will check.
Meanwhile, please check without this option.

saikumare-a · Answer 8 · Tue Dec 20 2022 18:47:02 GMT+0800 (China Standard Time)

Hi @yruslan ,

Could you please elaborate the change done and which all "record_format"s will improve performance with this fix

Ruslan Yushchenko · Answer 9 · Wed Dec 21 2022 22:18:47 GMT+0800 (China Standard Time)

Both D and D2 record formats should have the performance improvement.

saikumare-a · Answer 10 · Wed Dec 21 2022 23:08:28 GMT+0800 (China Standard Time)

Thank you for the update

saikumare-a · Answer 11 · Tue Dec 27 2022 18:55:20 GMT+0800 (China Standard Time)

Hi @yruslan,

we are getting different results with 2.6.1 and 2.6.2 snapshot provided in #545

got to see that .option("variable_size_occurs","true") is not working in 2.6.2 snapshot .

all followup columns post variable array are becoming null

Regards,
Saikumar

Ruslan Yushchenko · Answer 12 · Tue Dec 27 2022 19:18:54 GMT+0800 (China Standard Time)

Hi, good spot! Will check

Ruslan Yushchenko · Answer 13 · Tue Dec 27 2022 19:36:01 GMT+0800 (China Standard Time)

I can confirm. Could you please create a separate issue for this?

saikumare-a · Answer 14 · Tue Dec 27 2022 21:39:47 GMT+0800 (China Standard Time)

also,
in 2.6.2 snapshot,

"D" record_format is still performing index building and giving correct answer
"D2" record_format not performing index building , but giving above mentioned wrong result

saikumare-a · Answer 15 · Tue Dec 27 2022 21:43:57 GMT+0800 (China Standard Time)

I can confirm. Could you please create a separate issue for this?
Hi @yruslan,

#553 is created