AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spark input_file_name() not working in cobrix

kriswijnants opened this issue · comments

Hi,

Thank you for creating and maintaining Cobrix. It's a tool we discovered recently, and plan to implement it in our cloud data platform for our Mainframe project.

Just a small question to ask. We notice the input_file_name() command in spark always returns blanks when using cobrix. This in combination with the option("is_record_sequence", "true") option.

spark.read.format("cobol").option("copybook", "/mnt/inputMDP/BIWA_GUTEX/Copybooks/"+dbutils.widgets.get("version")+"/GAGUSECO_20070115.txt").option("is_record_sequence", "true").load("/mnt/inputMDP/BIWA_GUTEX/Datafiles/"+dbutils.widgets.get("version")+"/GA-GA324001*").withColumn("ISN_Source", input_file_name).createOrReplaceTempView("vw_gutex_GA")

Do you notice the same behaviour? Is there any chance to get this working?

Keep up the good work!

Regards,

Kris

Thanks for reporting the issue!

Looks interesting. Will take a look.

I can confirm the issue. Indeed, for variable-record-size files input_file_name() returns an empty string. That is due to the way we handle sparse indexes creation to parallelize the reading of such files.

It will take a while to fix this properly (probably need to create a custom RDD). But we can add a workaround to generate a column with the input file name for each record. That's what we are going to do first. It would look like this:

.option("with_input_file_name_col", "ISN_Source")

Just a double check. Which Spark version are you using?

We are planning to release Cobrix 2.0.0 first and all further changes will be made there. But it will support Spark 2.4 or above.

Great! Cobrix 2.0.0 is planned to be released this week. And the workaround for this issue can be expected sometime next week.

This should be fixed in the latest snapshot.
Please, try:

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol_2.11</artifactId>
            <version>2.0.1-SNAPSHOT</version>
        </dependency>

and let me know if the issue is fixed.

Forgot to mention. In order to get input file names for each record of a variable record length file a workaround is used. In your case the option looks like this:

.option("with_input_file_name_col", "ISN_Source")

I'd also recommend using

.option("pedantic", "true")

So that unrecognized options cause errors.

Hi Kris,

Snapshot version linking requires additional configuration in .m2/settings.xml. It might be even harder for managed clusters.

Try setting the version to 2.0.1 which was released today.

And please let me know if it worked for you.

Thank you,
Ruslan

H2. environment: docker: jupyter/all-spark-notebook:latest + Apache Toree - Scala

H2. Issue

when using

.option("file_start_offset", "600")
.option("file_end_offset", "600")

input_file_name() no longer works

H3. Annonymized extract

%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive

val sparkBuilder = SparkSession.builder().appName("Example")

val spark = sparkBuilder .getOrCreate()

`
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

spark.udf.register("get_file_name", (path: String) => path.split("/").last)
val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.load("file:///home/jovyan/data/BRAND/initial_transformed/FILEPATTERN*")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`

cobolDataframe //.filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH' .take(20) .foreach(v => println(v))

Hi Kris,
When you use file offset a different reader is used. Use the workaround for this case instead of input_file_name():

.option("with_input_file_name_col", "DPSource")

Hi Ruslan,

"with_input_file_name_col" seems be intended for "is_record_sequence = true" only.

In this case I have a copy book (fixed lenth) where the copybook does not mention the Header and footer.

Possibly actions I should take are:

  • get rid off the header and footer in a pre-prosessing (a less clean solution, to be avoided)
  • try to rewrite the copybook to accomodate header and footer (ideal solution, maybe as it should) consisting of several record types. I will look into this next.

I value your opinion. Mainframe code can be messy. It is a trade off between handling source particuliarities out of the box and keeping the cobrix code maintainable.

Thanks in advance,

Regards, Bart,

a test of your suggestion:

`
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
//.option("is_record_sequence", "true")
//.option("generate_record_id", "true") // for comparison with unconverted (windows) file only
.option("pedantic", "true")
//.option("with_input_file_name_col", "DPSourceTemp")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.option("with_input_file_name_col", "DPSourceTemp")
.load("file:///home/jovyan/data/BRAND/initial_transformed")
`

the result:

Name: java.lang.IllegalArgumentException Message: Option 'with_input_file_name_col' is supported only when 'is_record_sequence' = true. StackTrace: at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.validateSparkCobolOptions(CobolParametersParser.scala:467) at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.parse(CobolParametersParser.scala:209) at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:56) at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

Interesting. I will take a look. I think this can be easily fixed so that with_input_file_name_col would work in your case.

Opened #252 to continue the discussion there. Since the incompatibility between with_input_file_name_col and file_start_offset is a separate issue,