AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Debug functionality of EBCDIC data

bprasen opened this issue · comments

Hi,
Here I am not mentioning any issue but a functionality on debug purpose mainly. While Cobrix create dataframe it decodes the EBCDIC data according to the datatype of the given primitive field, Here if it is possible also to show the hex value of the column as well for a given option like add_hex = true. This functionality is only for debug purpose to check the data.

Interesting idea. This could be helpful to diagnose decoding issues.
Although we cannot make Spark show the original bytes in hex, but we can add additional fields to the output dataframe. For instance, if a schema has ID, FIRST-NAME and LAST-NAME and if the debug option is turned on, the schema will contain additional ID_DEBUG, FIRST-NAME_DEBUG and LAST-NAME_DEBUG fields containing HEX values of the original data before decoding.

Please, clarify a couple of things about your use case:

  • Do you want to debug a particular column or all columns in the schema?
  • The HEX values should correspond to the original data before conversion to ASCII/Unicode, right?

I was thinking about all the columns and yes the HEX values are original data before conversion. Truly speaking, I was also trying to modify the source code to have that functionality for FixedLengthNested option only right now. I can share the code with you if you want, may be that requires some standardisation. Thanks for your interest, please let me know your email so that I can send these codes for your review.

Great, thanks for the answers! I think this is a helpful feature and we are going to implement it.
You can send your code as a pull request, but it is not necessary. The feature seems pretty straightforward.

🎉 @bprasen, finally this very helpful feature is implemented and it is a part of 2.0.5 released today.