NUL characters in US ASCII getting converted to space character i.e " " insted of empty value.

Question

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value.

rohitavantsa opened this issue 2 years ago · comments

We have an US ASCII fixed byte length file .
File Contents:
1234 t ----> this row having three spaces
4567NULNULNULf -----> this row is having 3 NUL characters

CopyBook Contents:

01 tablename
05 record_ID PIC x(3)
05 record_status PIC x(3)
05 record_flag PIC x(1)

expected output:

[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status='',record_flag='f')]

Actual Output :
[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status=' ',record_flag='f')]

We are expected an empty value insted we are getting three white spaces. We are seeing the onprem data is an empty value. Can you please help us understand why we are seeing this issue in the scenario.??

@yruslan Can you please help us on this.

Ruslan Yushchenko · Answer 1 · Mon Mar 21 2022 21:23:25 GMT+0800 (China Standard Time)

Hi, thanks for the issue report. Could you please add

An example US ASCII with NUL characters.
The code snippet you are using to read the file.

Btw, does this option help removing extra spaces: .option("string_trimming_policy", "both") ?

rohitavantsa · Answer 2 · Tue Mar 22 2022 19:31:37 GMT+0800 (China Standard Time)

Hi @yruslan

String_trimming_policy is set to none in our case as we need to preserve the spaces while reading the file.

Here are the options we are using to read the file:
spark.read.format('cobol').option('copybookcontents', 'encoding': 'ASCII' , 'ebcdic_code_page':'CP037','string_trimming_policy':'none', 'debug_ignore_file_size':'true').load('filepath')

Please find the Sample file below:
sampleUS-ASCII file.txt

Please open this file in np++ to get the reference to NUL character .

Expected Output:
The row with NULNULNUL should be appeared as an empty string instead of ' ' (three spaces) which we are currently getting in our dataframe. Onprem system is providing this field as '' empty field.

Ruslan Yushchenko · Answer 3 · Tue Mar 22 2022 19:54:00 GMT+0800 (China Standard Time)

Hi,

Before looking deeper please try:

Removing 'ebcdic_code_page':'CP037' since it is applicable only for EBCDIC and
Adding .option("improved_null_detection", "true")

ASCII charset is set using this option:
.option("ascii_charset", "US_ASCII") (UTF_8 is the default)
(you can specify a different charset, of course)

rohitavantsa · Answer 4 · Tue Mar 22 2022 20:09:01 GMT+0800 (China Standard Time)

Sure @yruslan will try this.

rohitavantsa · Answer 5 · Wed Mar 23 2022 20:32:25 GMT+0800 (China Standard Time)

Hi @yruslan

We have tried removing option 'ebcdic_code_page':'CP037' and added .opt ion("improved_null_detection", "true") but still it not working as we expect.

To be more clear:
the NUL character which i am refering is hex value \x00 which is not getting read properly. While reading we actually expect a empty field but getting a space character. The file which i have give consists the NUL one.
You could actually try that and check is that the normal behavior or we need any kind of fix.

Thanks in advance

Ruslan Yushchenko · Answer 6 · Wed Mar 23 2022 22:43:06 GMT+0800 (China Standard Time)

Currently, all characters that are lower than 0x20 are replaced by spaces. If all characters in a field are 0x00, and improved_null_detection = true, the field becomes null.

Will check your file. Probably the correct behavior for ASCII would be not replacing lower characters with spaces and always skipping 0x00. This is something that needs to be implemented on our side.

rohitavantsa · Answer 7 · Thu Mar 24 2022 03:03:45 GMT+0800 (China Standard Time)

Sure thanks.

Ruslan Yushchenko · Answer 8 · Thu Mar 24 2022 16:11:15 GMT+0800 (China Standard Time)

This should be fixed in this branch:
https://github.com/AbsaOSS/cobrix/tree/bugfix/481-ignore-control-characters

You can test it by building that branch.

rohitavantsa · Answer 9 · Fri Mar 25 2022 20:54:36 GMT+0800 (China Standard Time)

Thanks @yruslan . This fix is helping us resolve the issue.

Ruslan Yushchenko · Answer 10 · Fri Mar 25 2022 21:44:32 GMT+0800 (China Standard Time)

Great! It will be released as a new version sometime next week