AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for 2 bytes ebcidic code pages (cp300,cp1364,cp1388)

BenceBenedek opened this issue · comments

Hi @yruslan,

as you have mentioned in a different issue, cobrix expects a 1 byte code page, but oriental code pages are using a 2 byte format.

Could you add support for the following 2 byte code pages?

https://web.archive.org/web/20141201234940/http://www-01.ibm.com/software/globalization/ccsid/ccsid300.html
https://web.archive.org/web/20141129222534/http://www-01.ibm.com/software/globalization/ccsid/ccsid1364.html
https://web.archive.org/web/20141129205408/http://www-01.ibm.com/software/globalization/ccsid/ccsid1388.html

Many thanks for your help.

Br., Bence

Hi @BenceBenedek,

Thanks for creating a feature request. Definitely keen to support it. Will take a look at the above specs.

Could I ask you also to post an example field definition that have such an encoding? This is just to understand how 'PIC' looks like. Is it 'X(10)' (with 10 declares byte size) or something different?

Thank you @yruslan for supporting this topic.

We have a fairly complex data set with variable length records (23 schemas).

As I see, attributes which holds free text are ranging from PIC X(60) to PIC X(4026).

I've dotted out some details just to make sure I don't violate any rules...

So, this is an example for a record type where we have free text which may come in any of the languages of our markets:

       10  FILLER                    REDEFINES   ...-DATAPART.
  ***      ========= ... ... TEXT==========================
         15  ...-REC....
           20  ...-...-TIMESTAMP             PIC X(26).
           20  ...-...-CLTX-TEXT             PIC X(4026).

Does this answer your question or I've overlooked something?

Many thanks for your help.

Sorry for the late reply. Thanks for you answer. It clarifies things a bit. However, in order to implement support for a code page, we need to have a table that says which byte sequence corresponds to each character.

The only spec I found is this: https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages and https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding

It seems quite complex. We will need some help implementing this in Cobrix. From our side, I can add support for 2 byte code pages, and from your side you can provide lookup tables. Would it be possible?

No worries @yruslan and thank you for pushing this.

I will get back to you when I've managed to get the look up tables.

Many thanks for your help.

ebcidic_codepage_mapping.txt

Hi, sorry for the late reply.

I've gathered some 1 and 2 byte mapping tables.

They follow a slightly different format compared to what you have in the current Cobrix lib as they use the unicode character position ID instead of the character itself.

Let me know if this is helpful or you would need something else.

That's great, thanks! Will take a look and add these code pages shortly.

The 'ccsid300' added as 'cp00300' in the current master branch.
Please, test it if you can. I get the feeling that I didn't use the mapping table correctly.
I used it as follows (in pseudocode):

val byte1 = bytes(i * 2)
val byte2 = bytes(i * 2 + 1)
val index = (byte1 + 256) % 256 * 256 + (byte2 + 256) % 256

// Here is where the mapping is applied
val ucs2char = ebcdic300(index)

//append ucs2chat to the string

The 'ccsid300' added as 'cp00300' in the current master branch. Please, test it if you can. I get the feeling that I didn't use the mapping table correctly. I used it as follows (in pseudocode):

val byte1 = bytes(i * 2)
val byte2 = bytes(i * 2 + 1)
val index = (byte1 + 256) % 256 * 256 + (byte2 + 256) % 256

// Here is where the mapping is applied
val ucs2char = ebcdic300(index)

//append ucs2chat to the string

Thank you @yruslan , will do a test tomorrow.

Let me know if it worked. If it didn't, could you also please provide an example of how the table could be used for converting a mainframe 2 byte sequence into a java character? That would be very helpful to fix the conversion.

Let me know if it worked. If it didn't, could you also please provide an example of how the table could be used for converting a mainframe 2 byte sequence into a java character? That would be very helpful to fix the conversion.

Sorry for the late reply.

For me it doesn't works, will check if I can get any info regarding the 2 byte sequence.

Do you mean that when you use 'cp00300' encoding an error is displayed or that output is incorrect?

Do you mean that when you use 'cp00300' encoding an error is displayed or that output is incorrect?

@yruslan I've mixed the country codes last time...it was my mistake.

At first glance the conversion looks good, but will do some checks.

+----------------------+
|dealer_text |
+----------------------+
|廱 |
|從徔傖廱 |
|傆傆幝幝幝傆 |
|廱 |
|傆傆幝傆幝幟幜幜幝䕣傆|
|廱 |
|廱 |
|廱 |
|建幝䕣廱 |
|廱 |

Great to hear! When you confirm it is working as expected, will add other 2 byte encodings

Great to hear! When you confirm it is working as expected, will add other 2 byte encodings

Not a 100% sure that the output is correct so while I'm trying to further investigate it from the input side, I've requested permission to share the relevant java encoder/decoder classes our company used for the input stream, so you can take a look at our implementation.

2byte_example.zip

Hello @yruslan , I've got the permission to share some example classes which were developed inhouse back in the day.
Let me know if this was helpful.

Thank you very much! This exactly what is needed and is extremely helpful. I'll try to fix the decoding logic of Cobrix in the next few days.

The fix is available at the master branch. I have more confidence that it should work as expected since characters in unit tests are based on the spec.

Also, I didn't know about single-byte to two-byte modes of the encoding. Thanks a lot for sharing the decoding algorithm!
Let me know if it works for you, and then I'll take a look at other 2 byte code pages.

The fix is available at the master branch. I have more confidence that it should work as expected since characters in unit tests are based on the spec.

Also, I didn't know about single-byte to two-byte modes of the encoding. Thanks a lot for sharing the decoding algorithm! Let me know if it works for you, and then I'll take a look at other 2 byte code pages.

Thanks a lot.

I will do some tests on Monday and will come back with the results.

@yruslan as far as I can tell, this implementation looks good. :)

|エンジンオイル・オイルエレメント交換ワイパーブレード交換 |
|メンテナンスプラスライト加入、EVA反映待ちおよび月初MBJ稼働日3日申請不可のため本日9/6の申請となりました。 |
|お申し出内容:フロントワイパー作動時にバタつく

症状確認、フロントワイパー作動時にワイパージャダーが発生している
フロントガラスの油膜を除去するが改善せず、ワイパーブレードの当たり不良と判断

フロントワイパーブレード交換
作業後、症状の改善を確認 |
| |
|お申し出内容:ブレーキ鳴きがする

入庫時症状確認
リヤブレーキパッド残量:3.0mm 要交換
リヤブレーキディスクローター測定値:19.4mm(限度値:19.4mm) 摩耗限度値のため要交換

リヤブレーキパッド、リヤブレーキディスクローター交換
交換後、症状の改善を確認 |
|ディストロニックのキャリブレーション実行 (作動開始後) |
|左右フロントウインド上下時「バキ」音発生 |
|エンジンチェックランプ点灯 |
|エンジンオイル交換の実行(フィルター含む)フロントウインドウ用シリコンワイパブレードへ2本の交換メインテナンスの追加作業:ダストフィルタの交換メインテナンスの追加作業: エアクリーナエレメントの交換

Amazing! So can add support for the other 2 encodings.

One question though.

In order to support ccsid300, the following tables are used: unicode300 and ebcdic300. The file with mapping tables (https://github.com/AbsaOSS/cobrix/files/10446793/ebcidic_codepage_mapping.txt) does not contain unicode300. I got it from EBCDIC300MappingTables.java

If unicode1364 and unicode1388 are the same as unicode300 then we have enough information to add support for ccsid1364 and ccsid 1388. Can you confirm that unicode1364 == unicode1388 == unicode300 or send unicode1364, unicode1388 please?

Amazing! So can add support for the other 2 encodings.

One question though.

In order to support ccsid300, the following tables are used: unicode300 and ebcdic300. The file with mapping tables (https://github.com/AbsaOSS/cobrix/files/10446793/ebcidic_codepage_mapping.txt) does not contain unicode300. I got it from EBCDIC300MappingTables.java

If unicode1364 and unicode1388 are the same as unicode300 then we have enough information to add support for ccsid1364 and ccsid 1388. Can you confirm that unicode1364 == unicode1388 == unicode300 or send unicode1364, unicode1388 please?

Good point, I will check it later today or early tomorrow!

Hello @yruslan I've just dig up the 2 java classes for 1364 and 1388, these classes will have both the EBCDIC and Unicode mapping tables, sorry for missing this out.

2byte_mapping_tables_1364_1388.zip

Great, thanks a lot! Will integrate into Cobrix tomorrow

'cp1364', 'cp1388' are now available in experimental mode in the new master branch.
'cp300' is now alias for 'cp00300'.
Thanks again for the contribution! Let me know if it works for you, and the new version '2.6.5' will be released soon.

'cp1364', 'cp1388' are now available in experimental mode in the new master branch. 'cp300' is now alias for 'cp00300'. Thanks again for the contribution! Let me know if it works for you, and the new version '2.6.5' will be released soon.

Sorry for the late reply, I will test it tomorrow and will give you a feedback.

@yruslan thank you, for me both Chinese (EBCDIC1388) and Korean (EBCDIC1364) looks good!

Amazing! Thank you for the great news!