Consider using RE2 (aka golang `regexp`) for wildcard matches
evankanderson opened this issue · comments
You might be interested to know that the Go regexp library uses finite automata already to implement the regexp
package. From that package:
The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. ... For more information about this property, see
It should be trivial(1) to replace the current shellstyle
matcher with a translation to the regexp
module, and to check for performance differences.
(1) exercise left to the reader
The one place where shellstyle
might be faster than regexp
is if you require start and end matching and allow only one glob in the middle. At that point, shellstyle
becomes a shortcut for startswith("stuff before *") AND endswith("stuff after *") AND len(string) > len(expression)
.
// An RE2::Set represents a collection of regexps that can
// be searched for simultaneously.
// Adds pattern to the set using the options passed to the constructor.
// Returns the index that will identify the regexp in the output of Match(),
// or -1 if the regexp cannot be parsed.
// Indices are assigned in sequential order starting from 0.
// Errors do not increment the index; if error is not NULL, *error will hold
// the error message from the parser.
int Add(const StringPiece& pattern, std::string* error);
Might be relevant? Source
First of all, I believe it's not just a Go thing, more or less every regex library has finite automata inside, isn't that the only sane approach?
RE2::Set at one level sounds much like what Quamina implements. The interesting thing about Quamina is that you can add as many patterns as you want and it doesn't affect the performance very much. Another useful thing about Quamina is that each pattern you add has an identifier and the Match operation tells you which of the patterns you added matched.
Anyhow, I'll have a closer look at the Go regexp library. I wonder if it gives programmatic access to the DFA produced by compiling the regex? If so, that would be incredibly useful. If not, it still might be worthwhile forking regexp in order to steal that compiler.
Personally, over in the IETF JSONPATH Working Group we're working on IRegexp, https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/ and that's the subset I'd like to try to support in Quamina.
If you read the linked paper, you'll see that both pcre
and Java have implemented features that require backtracking rather than a finite automata approach.
I mostly mentioned re2 (open source as well) and Go's regexp
libraries because they tend to be well-benchmarked and performance tuned already. (And yes, matching should be roughly O(n) worst-case on the length of the input once compilation is complete.)
Right, and of course having backtracking doesn't mean you're not using DFAs. Sometime I will share the funny/horrible story about how the Java regex implementation blew up JRuby on Rails. Anyhow, as noted, I'll take a close look as to whether we can get at Go regexp's automaton-building code. Of course, that software is aimed at answering the question “Does this regex match this string and where does it start?" while Quamina aims at the question “Which of the many patterns in this instance match this string?” but there still may be value to share.