Support for 2 bytes ebcidic code pages (cp300,cp1364,cp1388)

Question

Support for 2 bytes ebcidic code pages (cp300,cp1364,cp1388)

BenceBenedek opened this issue 2 years ago · comments

as you have mentioned in a different issue, cobrix expects a 1 byte code page, but oriental code pages are using a 2 byte format.

Could you add support for the following 2 byte code pages?

https://web.archive.org/web/20141201234940/http://www-01.ibm.com/software/globalization/ccsid/ccsid300.html
https://web.archive.org/web/20141129222534/http://www-01.ibm.com/software/globalization/ccsid/ccsid1364.html
https://web.archive.org/web/20141129205408/http://www-01.ibm.com/software/globalization/ccsid/ccsid1388.html

Many thanks for your help.

Br., Bence

Ruslan Yushchenko · Answer 1 · Thu Dec 08 2022 16:07:38 GMT+0800 (China Standard Time)

Hi @BenceBenedek,

Thanks for creating a feature request. Definitely keen to support it. Will take a look at the above specs.

Could I ask you also to post an example field definition that have such an encoding? This is just to understand how 'PIC' looks like. Is it 'X(10)' (with 10 declares byte size) or something different?

Bence Benedek · Answer 2 · Thu Dec 08 2022 20:25:03 GMT+0800 (China Standard Time)

Thank you @yruslan for supporting this topic.

We have a fairly complex data set with variable length records (23 schemas).

As I see, attributes which holds free text are ranging from PIC X(60) to PIC X(4026).

I've dotted out some details just to make sure I don't violate any rules...

So, this is an example for a record type where we have free text which may come in any of the languages of our markets:

       10  FILLER                    REDEFINES   ...-DATAPART.
  ***      ========= ... ... TEXT==========================
         15  ...-REC....
           20  ...-...-TIMESTAMP             PIC X(26).
           20  ...-...-CLTX-TEXT             PIC X(4026).

Does this answer your question or I've overlooked something?

Many thanks for your help.

Ruslan Yushchenko · Answer 3 · Fri Dec 30 2022 02:13:08 GMT+0800 (China Standard Time)

Sorry for the late reply. Thanks for you answer. It clarifies things a bit. However, in order to implement support for a code page, we need to have a table that says which byte sequence corresponds to each character.

The only spec I found is this: https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages and https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding

It seems quite complex. We will need some help implementing this in Cobrix. From our side, I can add support for 2 byte code pages, and from your side you can provide lookup tables. Would it be possible?

Bence Benedek · Answer 4 · Tue Jan 10 2023 17:05:21 GMT+0800 (China Standard Time)

No worries @yruslan and thank you for pushing this.

I will get back to you when I've managed to get the look up tables.

Many thanks for your help.

Bence Benedek · Answer 5 · Wed Jan 18 2023 21:40:14 GMT+0800 (China Standard Time)

ebcidic_codepage_mapping.txt

Hi, sorry for the late reply.

I've gathered some 1 and 2 byte mapping tables.

They follow a slightly different format compared to what you have in the current Cobrix lib as they use the unicode character position ID instead of the character itself.

Let me know if this is helpful or you would need something else.

Ruslan Yushchenko · Answer 6 · Thu Jan 19 2023 15:38:59 GMT+0800 (China Standard Time)

That's great, thanks! Will take a look and add these code pages shortly.

Ruslan Yushchenko · Answer 7 · Wed Feb 22 2023 18:00:32 GMT+0800 (China Standard Time)

The 'ccsid300' added as 'cp00300' in the current master branch.
Please, test it if you can. I get the feeling that I didn't use the mapping table correctly.
I used it as follows (in pseudocode):

val byte1 = bytes(i * 2)
val byte2 = bytes(i * 2 + 1)
val index = (byte1 + 256) % 256 * 256 + (byte2 + 256) % 256

// Here is where the mapping is applied
val ucs2char = ebcdic300(index)

//append ucs2chat to the string

Bence Benedek · Answer 8 · Wed Feb 22 2023 18:03:17 GMT+0800 (China Standard Time)

The 'ccsid300' added as 'cp00300' in the current master branch. Please, test it if you can. I get the feeling that I didn't use the mapping table correctly. I used it as follows (in pseudocode):
val byte1 = bytes(i * 2)
val byte2 = bytes(i * 2 + 1)
val index = (byte1 + 256) % 256 * 256 + (byte2 + 256) % 256

// Here is where the mapping is applied
val ucs2char = ebcdic300(index)

//append ucs2chat to the string

Thank you @yruslan , will do a test tomorrow.

Ruslan Yushchenko · Answer 9 · Wed Feb 22 2023 18:51:11 GMT+0800 (China Standard Time)

Let me know if it worked. If it didn't, could you also please provide an example of how the table could be used for converting a mainframe 2 byte sequence into a java character? That would be very helpful to fix the conversion.

Bence Benedek · Answer 10 · Tue Mar 07 2023 22:04:50 GMT+0800 (China Standard Time)

Let me know if it worked. If it didn't, could you also please provide an example of how the table could be used for converting a mainframe 2 byte sequence into a java character? That would be very helpful to fix the conversion.

Sorry for the late reply.

For me it doesn't works, will check if I can get any info regarding the 2 byte sequence.

Ruslan Yushchenko · Answer 11 · Wed Mar 08 2023 21:44:08 GMT+0800 (China Standard Time)

Do you mean that when you use 'cp00300' encoding an error is displayed or that output is incorrect?

Bence Benedek · Answer 12 · Wed Mar 08 2023 22:25:32 GMT+0800 (China Standard Time)

Do you mean that when you use 'cp00300' encoding an error is displayed or that output is incorrect?

@yruslan I've mixed the country codes last time...it was my mistake.

At first glance the conversion looks good, but will do some checks.

+----------------------+
|dealer_text |
+----------------------+
|廱 |
|從徔傖廱 |
|傆傆幝幝幝傆 |
|廱 |
|傆傆幝傆幝幟幜幜幝䕣傆|
|廱 |
|廱 |
|廱 |
|建幝䕣廱 |
|廱 |

Ruslan Yushchenko · Answer 13 · Thu Mar 23 2023 02:20:18 GMT+0800 (China Standard Time)

Great to hear! When you confirm it is working as expected, will add other 2 byte encodings

Bence Benedek · Answer 14 · Thu Mar 23 2023 02:46:10 GMT+0800 (China Standard Time)

Great to hear! When you confirm it is working as expected, will add other 2 byte encodings

Not a 100% sure that the output is correct so while I'm trying to further investigate it from the input side, I've requested permission to share the relevant java encoder/decoder classes our company used for the input stream, so you can take a look at our implementation.

Bence Benedek · Answer 15 · Tue Mar 28 2023 17:48:41 GMT+0800 (China Standard Time)

2byte_example.zip

Hello @yruslan , I've got the permission to share some example classes which were developed inhouse back in the day.
Let me know if this was helpful.

Ruslan Yushchenko · Answer 16 · Wed Mar 29 2023 14:09:04 GMT+0800 (China Standard Time)

Thank you very much! This exactly what is needed and is extremely helpful. I'll try to fix the decoding logic of Cobrix in the next few days.

Ruslan Yushchenko · Answer 17 · Fri Mar 31 2023 16:01:46 GMT+0800 (China Standard Time)

The fix is available at the master branch. I have more confidence that it should work as expected since characters in unit tests are based on the spec.

Also, I didn't know about single-byte to two-byte modes of the encoding. Thanks a lot for sharing the decoding algorithm!
Let me know if it works for you, and then I'll take a look at other 2 byte code pages.

Bence Benedek · Answer 18 · Fri Mar 31 2023 18:10:38 GMT+0800 (China Standard Time)

The fix is available at the master branch. I have more confidence that it should work as expected since characters in unit tests are based on the spec.

Also, I didn't know about single-byte to two-byte modes of the encoding. Thanks a lot for sharing the decoding algorithm! Let me know if it works for you, and then I'll take a look at other 2 byte code pages.

Thanks a lot.

I will do some tests on Monday and will come back with the results.

Bence Benedek · Answer 19 · Mon Apr 03 2023 20:26:52 GMT+0800 (China Standard Time)

@yruslan as far as I can tell, this implementation looks good. :)

症状確認、フロントワイパー作動時にワイパージャダーが発生している
フロントガラスの油膜を除去するが改善せず、ワイパーブレードの当たり不良と判断

フロントワイパーブレード交換
作業後、症状の改善を確認 |
| |
|お申し出内容：ブレーキ鳴きがする

入庫時症状確認
リヤブレーキパッド残量：3.0mm 要交換
リヤブレーキディスクローター測定値：19.4mm（限度値：19.4mm）摩耗限度値のため要交換

Ruslan Yushchenko · Answer 20 · Mon Apr 03 2023 20:54:23 GMT+0800 (China Standard Time)

Amazing! So can add support for the other 2 encodings.

One question though.

In order to support ccsid300, the following tables are used: unicode300 and ebcdic300. The file with mapping tables (https://github.com/AbsaOSS/cobrix/files/10446793/ebcidic_codepage_mapping.txt) does not contain unicode300. I got it from EBCDIC300MappingTables.java

If unicode1364 and unicode1388 are the same as unicode300 then we have enough information to add support for ccsid1364 and ccsid 1388. Can you confirm that unicode1364 == unicode1388 == unicode300 or send unicode1364, unicode1388 please?

Bence Benedek · Answer 21 · Mon Apr 03 2023 21:15:34 GMT+0800 (China Standard Time)

Amazing! So can add support for the other 2 encodings.

One question though.

In order to support ccsid300, the following tables are used: unicode300 and ebcdic300. The file with mapping tables (https://github.com/AbsaOSS/cobrix/files/10446793/ebcidic_codepage_mapping.txt) does not contain unicode300. I got it from EBCDIC300MappingTables.java

If unicode1364 and unicode1388 are the same as unicode300 then we have enough information to add support for ccsid1364 and ccsid 1388. Can you confirm that unicode1364 == unicode1388 == unicode300 or send unicode1364, unicode1388 please?

Good point, I will check it later today or early tomorrow!

Bence Benedek · Answer 22 · Tue Apr 04 2023 20:03:34 GMT+0800 (China Standard Time)

Hello @yruslan I've just dig up the 2 java classes for 1364 and 1388, these classes will have both the EBCDIC and Unicode mapping tables, sorry for missing this out.

2byte_mapping_tables_1364_1388.zip

Ruslan Yushchenko · Answer 23 · Tue Apr 04 2023 20:36:36 GMT+0800 (China Standard Time)

Great, thanks a lot! Will integrate into Cobrix tomorrow

Ruslan Yushchenko · Answer 24 · Wed Apr 05 2023 15:22:54 GMT+0800 (China Standard Time)

'cp1364', 'cp1388' are now available in experimental mode in the new master branch.
'cp300' is now alias for 'cp00300'.
Thanks again for the contribution! Let me know if it works for you, and the new version '2.6.5' will be released soon.

Bence Benedek · Answer 25 · Thu Apr 13 2023 19:51:20 GMT+0800 (China Standard Time)

'cp1364', 'cp1388' are now available in experimental mode in the new master branch. 'cp300' is now alias for 'cp00300'. Thanks again for the contribution! Let me know if it works for you, and the new version '2.6.5' will be released soon.

Sorry for the late reply, I will test it tomorrow and will give you a feedback.

Bence Benedek · Answer 26 · Mon Apr 17 2023 17:01:17 GMT+0800 (China Standard Time)

@yruslan thank you, for me both Chinese (EBCDIC1388) and Korean (EBCDIC1364) looks good!

Ruslan Yushchenko · Answer 27 · Mon Apr 17 2023 17:59:38 GMT+0800 (China Standard Time)

Amazing! Thank you for the great news!