NUL characters in US ASCII getting converted to space character i.e " " insted of empty value.
rohitavantsa opened this issue · comments
We have an US ASCII fixed byte length file .
File Contents:
1234 t ----> this row having three spaces
4567NULNULNULf -----> this row is having 3 NUL characters
CopyBook Contents:
01 tablename
05 record_ID PIC x(3)
05 record_status PIC x(3)
05 record_flag PIC x(1)
expected output:
[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status='',record_flag='f')]
Actual Output :
[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status=' ',record_flag='f')]
We are expected an empty value insted we are getting three white spaces. We are seeing the onprem data is an empty value. Can you please help us understand why we are seeing this issue in the scenario.??
@yruslan Can you please help us on this.
Hi, thanks for the issue report. Could you please add
- An example US ASCII with NUL characters.
- The code snippet you are using to read the file.
Btw, does this option help removing extra spaces: .option("string_trimming_policy", "both")
?
Hi @yruslan
String_trimming_policy is set to none in our case as we need to preserve the spaces while reading the file.
Here are the options we are using to read the file:
spark.read.format('cobol').option('copybookcontents', 'encoding': 'ASCII' , 'ebcdic_code_page':'CP037','string_trimming_policy':'none', 'debug_ignore_file_size':'true').load('filepath')
Please find the Sample file below:
sampleUS-ASCII file.txt
Please open this file in np++ to get the reference to NUL character .
Expected Output:
The row with NULNULNUL should be appeared as an empty string instead of ' ' (three spaces) which we are currently getting in our dataframe. Onprem system is providing this field as '' empty field.
Hi,
Before looking deeper please try:
- Removing
'ebcdic_code_page':'CP037'
since it is applicable only for EBCDIC and - Adding
.option("improved_null_detection", "true")
ASCII charset is set using this option:
.option("ascii_charset", "US_ASCII")
(UTF_8 is the default)
(you can specify a different charset, of course)
Sure @yruslan will try this.
Hi @yruslan
We have tried removing option 'ebcdic_code_page':'CP037' and added .opt ion("improved_null_detection", "true") but still it not working as we expect.
To be more clear:
the NUL character which i am refering is hex value \x00 which is not getting read properly. While reading we actually expect a empty field but getting a space character. The file which i have give consists the NUL one.
You could actually try that and check is that the normal behavior or we need any kind of fix.
Thanks in advance
Currently, all characters that are lower than 0x20 are replaced by spaces. If all characters in a field are 0x00, and improved_null_detection = true
, the field becomes null
.
Will check your file. Probably the correct behavior for ASCII would be not replacing lower characters with spaces and always skipping 0x00. This is something that needs to be implemented on our side.
Sure thanks.
This should be fixed in this branch:
https://github.com/AbsaOSS/cobrix/tree/bugfix/481-ignore-control-characters
You can test it by building that branch.
Thanks @yruslan . This fix is helping us resolve the issue.
Great! It will be released as a new version sometime next week