Support for UTF-16/32 encoded buffer in ASN1_get_object
pruthig opened this issue · comments
Hi Team,
I am using the API ASN1_get_object
to parse my ASN.1 buffer that contains a sequence of two strings. First string is of type UTF-8 and another one is of type UniversalString . The second string, which is of UniversalString type, has UTF-16 encoded characters. While parsing the ASN.1 buffer, I am able to retrieve the value of first string successfully, but I am getting the empty ("") value of the second string. Following is the signature of the API I am using -
ASN1_get_object(const unsigned char **ber_in, long *plength, int *ptag, int *pclass, long omax);
As we can see, first argument of this API is a buffer of type unsigned char
that cannot hold wide-characters. Can we add support for UTF-16/32 by enhancing the current implementation and have equivalent APIs for char16_t
and char32_t
buffer types to store wide-character strings?
It sounds like there are a few points of confusion here:
First, a UniversalString carries UTF-32, not UTF-16.
Second, using a char16_t
or char32_t
output would actually be forbidden (undefined behavior) in C and C++. There is no guarantee that the UniversalString payload is correctly aligned for a char32_t
, to say nothing of strict aliasing requirements. Even setting those requirements aside, UniversalString stores in big-endian, and most machines are little-endian these days.
Rather, the correct way to decode a UniversalString payload is to pass the byte output into a UTF-32BE decoder. Such a decoder is a mostly a trivially loading 4-byte big-endian words, though there are some checks it should do for whether each codepoint is valid. None of that is done by ASN1_get_object
because ASN1_get_object
's job is to just parse an individual TLV, leaving the caller to interpret the contents.
A good resource about Unicode, UTF-8 and co :
https://tonsky.me/blog/unicode/
Derived from an old article, I read 20 years ago