devongovett / glob-match

An extremely fast glob matching library in Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unicode characters vs codepoints

arlyon opened this issue ยท comments

Hi! It would be nice if this library specifies how it handles multi-codepoint-characters or graphemes (๐ŸŽ‰ ). I was comparing this against the doublestar go library (https://github.com/bmatcuk/doublestar) which seems to handle unicode whereas this evaluates globs at the codepoint level and so certain things don't line up.

Example: a[^b]c matches acc, but not a๐Ÿ”ฅc. Of course emoji is a simple example but there are large volumes of 'regular' unicode such as other-language characters that could end up in paths. I am willing to contribute (and have started) a feature-flag toggle that allows for this, since it will presumably be more performance intensive than simply going char-for-char when looking for grapheme boundaries.

I would not expect this to work with ranges (to me should be undefined), though we could have lowu32 <= var <=highu32

Thanks for the lib!

Alex

Yeah at the moment it just treats the glob as bytes. I think we could probably do this in a way that isn't too perf intensive. For example treating the glob as bytes to find special characters (e.g. *, [, etc.), but then within a character class interpreting as Unicode characters.