Unicode characters vs codepoints

Question

Unicode characters vs codepoints

arlyon opened this issue a year ago · comments

Hi! It would be nice if this library specifies how it handles multi-codepoint-characters or graphemes (🎉 ). I was comparing this against the doublestar go library (https://github.com/bmatcuk/doublestar) which seems to handle unicode whereas this evaluates globs at the codepoint level and so certain things don't line up.

Example: a[^b]c matches acc, but not a🔥c. Of course emoji is a simple example but there are large volumes of 'regular' unicode such as other-language characters that could end up in paths. I am willing to contribute (and have started) a feature-flag toggle that allows for this, since it will presumably be more performance intensive than simply going char-for-char when looking for grapheme boundaries.

I would not expect this to work with ranges (to me should be undefined), though we could have lowu32 <= var <=highu32

Thanks for the lib!

Alex

Devon Govett · Answer 1 · Fri Apr 07 2023 00:33:20 GMT+0800 (China Standard Time)

Yeah at the moment it just treats the glob as bytes. I think we could probably do this in a way that isn't too perf intensive. For example treating the glob as bytes to find special characters (e.g. *, [, etc.), but then within a character class interpreting as Unicode characters.