Support for UTF-16/32 encoded buffer in ASN1_get_object

Question

Support for UTF-16/32 encoded buffer in ASN1_get_object

pruthig opened this issue 2 months ago · comments

Hi Team,
I am using the API ASN1_get_object to parse my ASN.1 buffer that contains a sequence of two strings. First string is of type UTF-8 and another one is of type UniversalString . The second string, which is of UniversalString type, has UTF-16 encoded characters. While parsing the ASN.1 buffer, I am able to retrieve the value of first string successfully, but I am getting the empty ("") value of the second string. Following is the signature of the API I am using -
ASN1_get_object(const unsigned char **ber_in, long *plength, int *ptag, int *pclass, long omax);
As we can see, first argument of this API is a buffer of type unsigned char that cannot hold wide-characters. Can we add support for UTF-16/32 by enhancing the current implementation and have equivalent APIs for char16_t and char32_t buffer types to store wide-character strings?

David Benjamin · Answer 1 · Tue Jun 04 2024 22:53:39 GMT+0800 (China Standard Time)

It sounds like there are a few points of confusion here:

First, a UniversalString carries UTF-32, not UTF-16.

Second, using a char16_t or char32_t output would actually be forbidden (undefined behavior) in C and C++. There is no guarantee that the UniversalString payload is correctly aligned for a char32_t, to say nothing of strict aliasing requirements. Even setting those requirements aside, UniversalString stores in big-endian, and most machines are little-endian these days.

Rather, the correct way to decode a UniversalString payload is to pass the byte output into a UTF-32BE decoder. Such a decoder is a mostly a trivially loading 4-byte big-endian words, though there are some checks it should do for whether each codepoint is valid. None of that is done by ASN1_get_object because ASN1_get_object's job is to just parse an individual TLV, leaving the caller to interpret the contents.

FdaSilvaYY · Answer 2 · Wed Jun 05 2024 01:27:14 GMT+0800 (China Standard Time)

A good resource about Unicode, UTF-8 and co :

https://tonsky.me/blog/unicode/

Derived from an old article, I read 20 years ago