AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IndexBuilder step running in single thread for ASCII variable length files

saikumare-a opened this issue · comments

Background [Optional]

Platform: azure Databricks with file on azure storage
Cluster: 1 driver ( 16 core) + 3 workers (each with 16 core)
Spark API: pyspark

Cobrix:
using cobrix: spark-cobol_2.12

Code:
df = spark.read.format("cobol")
.option("copybook", copybook_path)
.option("encoding", "ascii")
.option("record_format", "D2")
.load(src_file_path)

df.write.parquet("output_path")

Scenario:
we have 300+ GB variable width variable length with multi-segmented file containing more than 2000 columns , file contains 13+ M records and input file is a text file (CRLF / LF are used to split records) .

I am trying to write to parquet file.

  1. indexBuilder stage is running in single partition ( 1 core of single worker node) and taking more than 2 hours
    2)after index build, writing parquet file is completing in 30 mins using multiple partitions & all worker cores/threads.

able to see data correctly in parquet file

Question

How to do I parallelize job across executors for indexBuilder step also?

Hi, @saikumare-a ,

The index building can't be parallelized for variable-length files. This is because we don't know where each record begins and ends given a record number. That's why index builder is there in the first place - to prepare the file for parallel processing.

If you want to improve performance, your options are:

  1. Use multiple input files, not just one big file. Multiple files are going to be processed in parallel.
  2. Use fixed-length format. This way the location of each record is known so it can run in parallel.

Hi @yruslan,

Thank you for the reply.

The files are in ASCII encoding and each record is in single line ( CR/LF separated). each record is of variable length with multiple segments (each segment has variable length and making overall record as variable length)

is there any other opportunity as each record is in single line only

For ASCII files there is an extension that can skip indexing.

.option("record_format", "D2")

But it can only be used if record id and segment id generation is not used.

Hi @yruslan ,

tried "D2" record format ( without record id and segment id generation) for ASCII files, but still see first stage as index building.

What's your exact spark.read code snippet?

Hi @yruslan , below is the code snippet we are using
df = spark.read.format("cobol")
.option("copybook","<copybook_path>")
.option("generate_record_id","false")
.option("drop_value_fillers","false")
.option("drop_group_fillers","false")
.option("pedantic","true")
.option("encoding","ascii")
.option("variable_size_occurs","true")
.option("record_format","D2")
.load(<file_path>)

This option is incompatible with D2, it switches to D

.option("variable_size_occurs","true")

I'm not quite sure why it is like that. It should be compatible. Will check.
Meanwhile, please check without this option.

Hi @yruslan ,

Could you please elaborate the change done and which all "record_format"s will improve performance with this fix

Both D and D2 record formats should have the performance improvement.

Thank you for the update

Hi @yruslan,

we are getting different results with 2.6.1 and 2.6.2 snapshot provided in #545

got to see that .option("variable_size_occurs","true") is not working in 2.6.2 snapshot .

all followup columns post variable array are becoming null

Regards,
Saikumar

Hi, good spot! Will check

I can confirm. Could you please create a separate issue for this?

also,
in 2.6.2 snapshot,

  1. "D" record_format is still performing index building and giving correct answer
  2. "D2" record_format not performing index building , but giving above mentioned wrong result

I can confirm. Could you please create a separate issue for this?
Hi @yruslan,

#553 is created