Another Unicode issue

Question

Another Unicode issue

holopoj opened this issue 5 years ago · comments

Ran into an issue with unicode 0x300. This can be reproduced with the below code:

var a= "rosalía castro";
var b= "rosalía";
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
Console.WriteLine(t.Retrieve(a).Count());

This will print 0. Note that the second item added is not a byte-equal prefix of s, their unicode sequences are different. Though a.StartsWith(b) returns true, presumably because of culture settings. The second one uses two characters: a normal 'i' followed by unicode 0x300 to add the accent, while the first one uses a single accented i character.

rjgotten · Answer 1 · Wed Sep 11 2019 03:00:29 GMT+0800 (China Standard Time)

The proper fully compatible solution that would resolve most if not all issues with Unicode is to rewrite all substring handling to use the StringInfo class to work with 'real' characters, i.e. graphemes, rather than individual char codepoints.

However, the public StringInfo API is very uncomfortable. E.g. you have to manually pump a non-generic IEnumerator with MoveNext() to iterate over a string's graphemes. There's no IEnumerable<> support and thus also no foreach support.

[EDIT]

It looks like this wouldn't be too difficult of a change with the Ukkonen trie, if you go about it naively and just replace regular SubString() calls and Length accesses with StringInfo-driven equivalents.

The downsides are that it would probably murder atleast construction performance; and that the Node class will need to hold an IDictionary<string,Edge> as a grapheme may not fit in a single char. That last bit means an increase in space taken as well, but luckily it's still bounded. Unicode graphemes aren't endlessly long, iirc.

Might be better off by one-time converting all strings into a dedicated data structure operating at the grapheme level though. That would certainly keep code more maintainable.

prj · Answer 2 · Thu Apr 30 2020 17:03:57 GMT+0800 (China Standard Time)

For me the thing throws OutOfBoundsExceptions when I even try to construct something that has any special characters in it. And all my sources are in ISO-8859-1.
So it seems this project is useless in any real world application, unless you're dealing with plain ASCII.

Jesús López · Answer 3 · Sun Jan 31 2021 19:29:29 GMT+0800 (China Standard Time)

@holopoj ,

Preparing the text for the trie before adding and before searching is a good workaround. I will work with Basic Multilingual Plane which contains characters for almost all modern languages, and a large number of symbols:

/// <summary>
/// It Removes diacritics from text, converts it to lower, removes surrogate 
/// characters and normalizes it to prepare text for accent and case insensitive search
/// </summary>
/// <param name="text"></param>
/// <returns></returns>
static string PrepareForTrie(string text)
{
    //return text;
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    for (int i = 0; i < normalizedString.Length; i++)
    {
        char c = normalizedString[i];
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (char.IsHighSurrogate(c) || char.IsLowSurrogate(c)) continue;
        if (unicodeCategory != UnicodeCategory.NonSpacingMark && unicodeCategory != UnicodeCategory.Control)
        {
            stringBuilder.Append(char.ToLower(c));
        }
    }
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

Now this code works, it shows 2 and 1. The second item still has the double code point grapheme:

var a = PrepareForTrie("Rosalia de Castro");
var b = PrepareForTrie("rosalía");
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
foreach (var value in t.Retrieve(b))
{
     Console.WriteLine(value);
}