spark input_file_name() not working in cobrix

Question

spark input_file_name() not working in cobrix

kriswijnants opened this issue 4 years ago · comments

Hi,

Thank you for creating and maintaining Cobrix. It's a tool we discovered recently, and plan to implement it in our cloud data platform for our Mainframe project.

Just a small question to ask. We notice the input_file_name() command in spark always returns blanks when using cobrix. This in combination with the option("is_record_sequence", "true") option.

spark.read.format("cobol").option("copybook", "/mnt/inputMDP/BIWA_GUTEX/Copybooks/"+dbutils.widgets.get("version")+"/GAGUSECO_20070115.txt").option("is_record_sequence", "true").load("/mnt/inputMDP/BIWA_GUTEX/Datafiles/"+dbutils.widgets.get("version")+"/GA-GA324001*").withColumn("ISN_Source", input_file_name).createOrReplaceTempView("vw_gutex_GA")

Do you notice the same behaviour? Is there any chance to get this working?

Keep up the good work!

Regards,

Kris

Ruslan Yushchenko · Answer 1 · Tue Dec 10 2019 15:33:37 GMT+0800 (China Standard Time)

Thanks for reporting the issue!

Looks interesting. Will take a look.

Ruslan Yushchenko · Answer 2 · Tue Dec 10 2019 22:03:22 GMT+0800 (China Standard Time)

I can confirm the issue. Indeed, for variable-record-size files input_file_name() returns an empty string. That is due to the way we handle sparse indexes creation to parallelize the reading of such files.

It will take a while to fix this properly (probably need to create a custom RDD). But we can add a workaround to generate a column with the input file name for each record. That's what we are going to do first. It would look like this:

.option("with_input_file_name_col", "ISN_Source")

Ruslan Yushchenko · Answer 3 · Tue Dec 10 2019 22:05:04 GMT+0800 (China Standard Time)

Just a double check. Which Spark version are you using?

We are planning to release Cobrix 2.0.0 first and all further changes will be made there. But it will support Spark 2.4 or above.

kriswijnants · Answer 4 · Tue Dec 10 2019 22:07:00 GMT+0800 (China Standard Time)

Hi Ruslan, Thanks for your intervention. Really appreciate this! We are running on the Databricks runtime 6.2. So we use spark version 2.4.4. Regards, Kris Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: dinsdag 10 december 2019 15:05 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) Just a double check. Which Spark version are you using? We are planning to release Cobrix 2.0.0 first and all further changes will be made there. But it will support Spark 2.4 or above. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU2BVDM7LCASKSXC7PTQX6OZDA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPKZYY%23issuecomment-564047075&data=02%7C01%7Ckris.wijnants%40kohera.be%7C968d7e2082384915d9db08d77d79f628%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637115835110774511&sdata=xfwMf%2F8SbQ5Xzg52TGLLde7rnu9P97uyKVZw7xMdbe4%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSUZFFQBDGPXNOYL7D3DQX6OZDANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C968d7e2082384915d9db08d77d79f628%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637115835110774511&sdata=HNYtsVttf8fhLHBARa1k%2FQdfRKH18gybDHvRd291RBQ%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

Ruslan Yushchenko · Answer 5 · Tue Dec 10 2019 22:45:51 GMT+0800 (China Standard Time)

Great! Cobrix 2.0.0 is planned to be released this week. And the workaround for this issue can be expected sometime next week.

kriswijnants · Answer 6 · Tue Dec 10 2019 22:46:49 GMT+0800 (China Standard Time)

Great news. Thanks Ruslan! Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: dinsdag 10 december 2019 15:46 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) Great! Cobrix 2.0.0 is planned to be released this week. And the workaround for this issue can be expected sometime next week. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU5CQLKIC2F4LC43NSLQX6TR7A5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPPJ4Q%23issuecomment-564065522&data=02%7C01%7Ckris.wijnants%40kohera.be%7Cb4ec7fe5d008462962e108d77d7fa7f9%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637115859539676949&sdata=Nd07Sh3DeOInfc6NjNoK2slXkGeon44mojCzT89MKho%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU65PCGSQQLHMLLTSEDQX6TR7ANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7Cb4ec7fe5d008462962e108d77d7fa7f9%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637115859539686944&sdata=fgjNFw9wvc2TiPcYLH5arXNM2UuSY%2ByVh2hv1PozIkY%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

Ruslan Yushchenko · Answer 7 · Tue Dec 17 2019 22:11:31 GMT+0800 (China Standard Time)

This should be fixed in the latest snapshot.
Please, try:

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol_2.11</artifactId>
            <version>2.0.1-SNAPSHOT</version>
        </dependency>

and let me know if the issue is fixed.

kriswijnants · Answer 8 · Tue Dec 17 2019 22:27:01 GMT+0800 (China Standard Time)

Thanks Ruslan, I’ll do so. Regards, Kris Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: dinsdag 17 december 2019 15:12 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) This should be fixed in the latest snapshot. Please, try: <dependency> <groupId>za.co.absa.cobrix</groupId> <artifactId>spark-cobol_2.11</artifactId> <version>2.0.1-SNAPSHOT</version> </dependency> and let me know if the issue is fixed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU77YDCQLDXK5CH6OH3QZDMZJA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHCPTJI%23issuecomment-566557093&data=02%7C01%7Ckris.wijnants%40kohera.be%7C8071bf5759d04ab1d9c508d782fb0564%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637121886951103565&sdata=svdCHUS7CBaQIJi8XGxWlwOU0EyEdPmPzqXq4k7N2uw%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU5NQRFB2IDEEAMQ3N3QZDMZJANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C8071bf5759d04ab1d9c508d782fb0564%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637121886951103565&sdata=Wu5lwzieS5XOWTpPbMSLfWOOnFIiHjB8y0Y6IMi7Jck%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

Ruslan Yushchenko · Answer 9 · Wed Dec 18 2019 15:42:46 GMT+0800 (China Standard Time)

Forgot to mention. In order to get input file names for each record of a variable record length file a workaround is used. In your case the option looks like this:

.option("with_input_file_name_col", "ISN_Source")

I'd also recommend using

.option("pedantic", "true")

So that unrecognized options cause errors.

kriswijnants · Answer 10 · Fri Dec 20 2019 20:48:31 GMT+0800 (China Standard Time)

Hi Ruslan, Apologies for replying late. I get an error when I try to install the new version over Maven. So for the moment we are still using the version 1.0.2 [cid:image001.png@01D5B73C.253579C0] Once I get the maven package running I’ll try. But I believe you on your word when you say it’s fixed. Thanks for having a look into this! Regards, Kris Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: woensdag 18 december 2019 8:43 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) Forgot to mention. In order to get input file names for each record of a variable record length file a workaround is used. In your case the option looks like this: .option("with_input_file_name_col", "ISN_Source") I'd also recommend using .option("pedantic", "true") So that unrecognized options cause errors. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU6TL7JXHYWLRZFLF23QZHH7NA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHFHIKY%23issuecomment-566916139&data=02%7C01%7Ckris.wijnants%40kohera.be%7C67e1da1addc8414f012d08d7838de0e2%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637122517754462709&sdata=B9a9JziTGSju3nXgY%2FzMzNF3s7BE9GLkU54DbdnKoWc%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSUYSSK7GCMLU5OPW4L3QZHH7NANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C67e1da1addc8414f012d08d7838de0e2%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637122517754467693&sdata=s5Iv97RAfNuebGGDSeywS0XEw3Qb1AWRhK3he1ahW%2B4%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

Ruslan Yushchenko · Answer 11 · Fri Dec 20 2019 20:53:11 GMT+0800 (China Standard Time)

Hi Kris,

Snapshot version linking requires additional configuration in .m2/settings.xml. It might be even harder for managed clusters.

Try setting the version to 2.0.1 which was released today.

And please let me know if it worked for you.

Thank you,
Ruslan

kriswijnants · Answer 12 · Fri Dec 20 2019 21:06:49 GMT+0800 (China Standard Time)

Hi Ruslan, I just tried, and it works perfect! It’s now showing the filename of ebcdic files using the option is_record_sequence = true. Thanks a lot for your efforts! Regards, Kris Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: vrijdag 20 december 2019 13:53 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) Hi Kris, Snapshot version linking requires additional configuration in .m2/settings.xml. It might be even harder for managed clusters. Try setting the version to 2.0.1 which was released today. And please let me know if it worked for you. Thank you, Ruslan — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU5KQGAQARCJRBOVQ5TQZS53PA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHM3DJA%23issuecomment-567914916&data=02%7C01%7Ckris.wijnants%40kohera.be%7C70dd2f07eec548269d8e08d7854b92b6%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637124431937930582&sdata=au4S1vmXJI2QBWqOMbfmBhfV2WWfv5aLPA6ZOdZJYxg%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU6Z6E2GAZPXY47OKPTQZS53PANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C70dd2f07eec548269d8e08d7854b92b6%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637124431937940579&sdata=yDl50HT2c2RxFEJLJbYJDt7jZo%2FxNl%2F3zuMiJ7WGK9g%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

bart-at-qqdatafruits · Answer 13 · Thu Feb 20 2020 22:54:39 GMT+0800 (China Standard Time)

H2. environment: docker: jupyter/all-spark-notebook:latest + Apache Toree - Scala

H2. Issue

when using

.option("file_start_offset", "600")
.option("file_end_offset", "600")

input_file_name() no longer works

H3. Annonymized extract

%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive

val sparkBuilder = SparkSession.builder().appName("Example")

val spark = sparkBuilder .getOrCreate()

`
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

spark.udf.register("get_file_name", (path: String) => path.split("/").last)
val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.load("file:///home/jovyan/data/BRAND/initial_transformed/FILEPATTERN*")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`

cobolDataframe //.filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH' .take(20) .foreach(v => println(v))

kriswijnants · Answer 14 · Fri Feb 21 2020 00:24:07 GMT+0800 (China Standard Time)

Hi Ruslan, Hope you are doing well. I’m also involved in the project Bart Debersaque is working on. So you can reach out to or Bart or myself for testing, screenshots, … etc. With best regards, Kris Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: bart-at-qqdatafruits <notifications@github.com> Sent: donderdag 20 februari 2020 15:55 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) H2. environment: docker: jupyter/pyspark-notebook:latest + Apache Toree - Scala H2. Issue when using .option("file_start_offset", "600") .option("file_end_offset", "600") input_file_name() bo longer works H3. Annonymized extract %AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive val sparkBuilder = SparkSession.builder().appName("Example") val spark = sparkBuilder .getOrCreate() ` import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession spark.udf.register("get_file_name", (path: String) => path.split("/").last) val cobolDataframe = spark .read .format("za.co.absa.cobrix.spark.cobol.source") .option("pedantic", "true") .option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt") .option("file_start_offset", "600") .option("file_end_offset", "600") .load("file:///home/jovyan/data/BRAND/initial_transformed/FILEPATTERN*") .withColumn("DPSource", callUDF("get_file_name", input_file_name())) ` cobolDataframe //.filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH' .take(20) .foreach(v => println(v)) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU6XKG4FS6ZQBHMYDWLRD2KTBA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMOM6AY%23issuecomment-589090563&data=02%7C01%7Ckris.wijnants%40kohera.be%7C3f15a09309854977db9308d7b614d0f1%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178072829827348&sdata=%2FLKqzyJiGim8z0YhwtaAnv3eb9jwvdgwHR07UYJ2r2M%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU5GO5CODQEBT3LGU2DRD2KTBANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C3f15a09309854977db9308d7b614d0f1%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178072829837344&sdata=yW1nlBG9LR0TSHZGRyiMsSMAPZZHeEb4QdDp4n4Bjks%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

Ruslan Yushchenko · Answer 15 · Fri Feb 21 2020 04:32:00 GMT+0800 (China Standard Time)

Hi Kris,
When you use file offset a different reader is used. Use the workaround for this case instead of input_file_name():

.option("with_input_file_name_col", "DPSource")

kriswijnants · Answer 16 · Fri Feb 21 2020 15:46:10 GMT+0800 (China Standard Time)

Hi Ruslan, Thanks for your quick reply! Regards, Kris Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: donderdag 20 februari 2020 21:32 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) Hi Kris, When you use file offset a different reader is used. Use the workaround for this case instead of input_file_name(): .option("with_input_file_name_col", "DPSource") — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSUYDLURPMHW44VGHDE3RD3SEDA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMP6XPI%23issuecomment-589294525&data=02%7C01%7Ckris.wijnants%40kohera.be%7C74f550ce0f23415f80f908d7b643f237%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178275251475132&sdata=nTjsZP2yk6OyrhzPhqifXITa2GM9hSI8hXvW0KJ0xV0%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU4JKF2BSD33D3B5Z7TRD3SEDANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7C74f550ce0f23415f80f908d7b643f237%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178275251475132&sdata=IVlCrphS2Wz64M%2By5u1gIT%2BdrnYylcDo79hOwFEjNvI%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>

bart-at-qqdatafruits · Answer 17 · Fri Feb 21 2020 17:12:23 GMT+0800 (China Standard Time)

Hi Ruslan,

"with_input_file_name_col" seems be intended for "is_record_sequence = true" only.

In this case I have a copy book (fixed lenth) where the copybook does not mention the Header and footer.

Possibly actions I should take are:

get rid off the header and footer in a pre-prosessing (a less clean solution, to be avoided)
try to rewrite the copybook to accomodate header and footer (ideal solution, maybe as it should) consisting of several record types. I will look into this next.

I value your opinion. Mainframe code can be messy. It is a trade off between handling source particuliarities out of the box and keeping the cobrix code maintainable.

Thanks in advance,

Regards, Bart,

a test of your suggestion:

`
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
//.option("is_record_sequence", "true")
//.option("generate_record_id", "true") // for comparison with unconverted (windows) file only
.option("pedantic", "true")
//.option("with_input_file_name_col", "DPSourceTemp")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.option("with_input_file_name_col", "DPSourceTemp")
.load("file:///home/jovyan/data/BRAND/initial_transformed")
`

the result:

Name: java.lang.IllegalArgumentException Message: Option 'with_input_file_name_col' is supported only when 'is_record_sequence' = true. StackTrace: at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.validateSparkCobolOptions(CobolParametersParser.scala:467) at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.parse(CobolParametersParser.scala:209) at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:56) at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

Ruslan Yushchenko · Answer 18 · Fri Feb 21 2020 23:12:08 GMT+0800 (China Standard Time)

Interesting. I will take a look. I think this can be easily fixed so that with_input_file_name_col would work in your case.

Ruslan Yushchenko · Answer 19 · Fri Feb 21 2020 23:15:01 GMT+0800 (China Standard Time)

Opened #252 to continue the discussion there. Since the incompatibility between with_input_file_name_col and file_start_offset is a separate issue,

kriswijnants · Answer 20 · Fri Feb 21 2020 23:16:43 GMT+0800 (China Standard Time)

Thanks! Kris Wijnants Innovation Wizard m +32 (0)496 121 111 From: Ruslan Yushchenko <notifications@github.com> Sent: vrijdag 21 februari 2020 16:12 To: AbsaOSS/cobrix <cobrix@noreply.github.com> Cc: Wijnants Kris <kris.wijnants@kohera.be>; Author <author@noreply.github.com> Subject: Re: [AbsaOSS/cobrix] spark input_file_name() not working in cobrix (#221) Interesting. I will take a look. I think this can be easily fixed so that with_input_file_name_col would work in your case. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAbsaOSS%2Fcobrix%2Fissues%2F221%3Femail_source%3Dnotifications%26email_token%3DANWTSU2SY7IF7TKCEUFO6FDRD7VMTA5CNFSM4JWVEGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMTAOJI%23issuecomment-589694757&data=02%7C01%7Ckris.wijnants%40kohera.be%7Cffffb189c9a44811c11308d7b6e06c31%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178947318866001&sdata=afZ%2FCBk4Dk7cazHhvPHbmzZt7Zx%2FKHATWoTO%2FLv%2B52o%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANWTSU5WHWXW4RVZO7NOOWLRD7VMTANCNFSM4JWVEGAQ&data=02%7C01%7Ckris.wijnants%40kohera.be%7Cffffb189c9a44811c11308d7b6e06c31%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637178947318875997&sdata=vYkRTLaKg%2BDx4tvJXMak3ap8E%2Fmuyot75UtRJBIy4f4%3D&reserved=0>. This email has been scanned by BullGuard antivirus protection. For more info visit www.bullguard.com<http://www.bullguard.com/tracking.aspx?affiliate=bullguard&buyaffiliate=smtp&url=/>