ufcpp / GraphemeSplitter

A C# implementation of the Unicode grapheme cluster breaking algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GraphemeSplitter

A C# implementation of the Unicode grapheme cluster breaking algorithm.

Notes

  • This library uses Unicode 10.0 version of grepheme boundary algorithm.
  • In .NET 5.0, StringInfo.GetTextElementEnumerator can enumerate graphemes correctly with Unicode 13.0 algorithm.

NuGet package

https://www.nuget.org/packages/GraphemeSplitter/

Install-Package GraphemeSplitter

Sample

using GraphemeSplitter;
using static System.Console;
using static System.String;

public partial class Program
{
    static string Split(string s) => Join(", ", s.GetGraphemes());

    static void Main()
    {
        WriteLine(Split("πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦")); // πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦, πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦, πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦
    }
}

Web Sample:

Razor Page Sample

Implementation

This library basically implements http://unicode.org/reports/tr29/.

Expample:

type text split result
diacritical marks à̠́̑bΜ‚ΜƒΜ’Μ£cΜƒΜ„Μ£Μ€dΜ…Μ†Μ₯Μ¦ "à̠́̑", "bΜ‚ΜƒΜ’Μ£", "cΜƒΜ„Μ£Μ€", "dΜ…Μ†Μ₯Μ¦"
variation selector 葛葛󠄀葛󠄁 "θ‘›", "θ‘›σ „€", "葛󠄁"
asian syllable ᄋᅑᆫ녕ᄒᅑ세요 "ᄋᅑᆫ", "α„‚α…§α†Ό", "α„’α…‘", "세", "α„‹α…­"
family emoji πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦ "πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦", "πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦", "πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦"
emoji skin tone πŸ‘©πŸ»πŸ‘±πŸΌπŸ‘§πŸ½πŸ‘¦πŸΎ "πŸ‘©πŸ»", "πŸ‘±πŸΌ", "πŸ‘§πŸ½", "πŸ‘¦πŸΎ"

but slacks out the GB10, GB12, and GB13 rules for simplification.

original:

  • GB10 … (E_Base | EBG) Extend* Γ— E_Modifier
  • GB12 … sot (RI RI)* RI Γ— RI
  • GB13 … [^RI] (RI RI)* RI Γ— RI

implemented:

  • GB10 … (E_Base | EBG) Γ— Extend
  • GB10 … (E_Base | EBG | Extend) Γ— E_Modifier
  • GB12/GB13 … RI Γ— RI

Difference is:

sequence original implemented
aΜ€πŸ»β€ (U+61, U+300, U+1F3FB) Γ— Γ· Γ— Γ—
πŸ‡―πŸ‡΅πŸ‡ΊπŸ‡Έ (U+1F1EF, U+1F1F5, U+1F1FA, U+1F1F8) Γ— Γ· Γ— Γ— Γ— Γ—

(where Γ· and Γ— means boundary and no bounadry respectively.)

Acknowledgements

This library is influenced by

About

A C# implementation of the Unicode grapheme cluster breaking algorithm

License:MIT License


Languages

Language:C# 99.6%Language:HTML 0.3%Language:PowerShell 0.1%Language:CSS 0.0%