AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value.

rohitavantsa opened this issue · comments

We have an US ASCII fixed byte length file .
File Contents:
1234 t ----> this row having three spaces
4567NULNULNULf -----> this row is having 3 NUL characters

CopyBook Contents:

01 tablename
05 record_ID PIC x(3)
05 record_status PIC x(3)
05 record_flag PIC x(1)

expected output:

[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status='',record_flag='f')]

Actual Output :
[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status=' ',record_flag='f')]

We are expected an empty value insted we are getting three white spaces. We are seeing the onprem data is an empty value. Can you please help us understand why we are seeing this issue in the scenario.??

@yruslan Can you please help us on this.

Hi, thanks for the issue report. Could you please add

  • An example US ASCII with NUL characters.
  • The code snippet you are using to read the file.

Btw, does this option help removing extra spaces: .option("string_trimming_policy", "both") ?

Hi @yruslan

String_trimming_policy is set to none in our case as we need to preserve the spaces while reading the file.

Here are the options we are using to read the file:
spark.read.format('cobol').option('copybookcontents', 'encoding': 'ASCII' , 'ebcdic_code_page':'CP037','string_trimming_policy':'none', 'debug_ignore_file_size':'true').load('filepath')

Please find the Sample file below:
sampleUS-ASCII file.txt

Please open this file in np++ to get the reference to NUL character .

Expected Output:
The row with NULNULNUL should be appeared as an empty string instead of ' ' (three spaces) which we are currently getting in our dataframe. Onprem system is providing this field as '' empty field.

Hi,

Before looking deeper please try:

  • Removing 'ebcdic_code_page':'CP037' since it is applicable only for EBCDIC and
  • Adding .option("improved_null_detection", "true")

ASCII charset is set using this option:
.option("ascii_charset", "US_ASCII") (UTF_8 is the default)
(you can specify a different charset, of course)

Sure @yruslan will try this.

Hi @yruslan

We have tried removing option 'ebcdic_code_page':'CP037' and added .opt ion("improved_null_detection", "true") but still it not working as we expect.

To be more clear:
the NUL character which i am refering is hex value \x00 which is not getting read properly. While reading we actually expect a empty field but getting a space character. The file which i have give consists the NUL one.
You could actually try that and check is that the normal behavior or we need any kind of fix.

Thanks in advance

Currently, all characters that are lower than 0x20 are replaced by spaces. If all characters in a field are 0x00, and improved_null_detection = true, the field becomes null.

Will check your file. Probably the correct behavior for ASCII would be not replacing lower characters with spaces and always skipping 0x00. This is something that needs to be implemented on our side.

Sure thanks.

This should be fixed in this branch:
https://github.com/AbsaOSS/cobrix/tree/bugfix/481-ignore-control-characters

You can test it by building that branch.

Thanks @yruslan . This fix is helping us resolve the issue.

Great! It will be released as a new version sometime next week