isaacs / minimatch

a glob matcher in javascript

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Single asterisk and embedded folders

ruffin-- opened this issue · comments

I'm trying to write a tool that matches the behavior of another tool that claims to use glob format when matching files, but its * handling is different than minimatch's.

Here's where this app explains its * management in its docs:

{ "Includes": ["*test1*"] },

Includes all test [files] that contain test1 in its path. This is in glob format.

That's the same rule I see at this glob test site, namely:

*.{js,jsx,ts,tsx,md,html}

Match js, jsx, ts, tsx, md and html files in the root folder and its descendants [emphasis mine]

In that case it's a combination match, but the simpler case holds too. *.js matches every *.js file, regardless of path depth.

When I try the same thing (use a pattern surrounded by *s) with minimatch, however, I don't get as broad a match. minimatch only matches folder-free patterns.

Here are some tests with minimatch...

var minimatch = require('minimatch');

function test(path, pattern) {
    console.log({
        path,
        pattern,
        match: minimatch(path, pattern, { dot: true }),
    });
}

function testPaths(pattern) {
    var paths = [
        './folder/myFile.test.js', 
        'myFile.test.js', 
        '/path/to/folder/myFile.test.js'
    ];

    paths.forEach((path) => test(path, pattern));
    console.log('\n');
}

testPaths('*test*');
testPaths('**/*test*');

and the results:

// should ALL be true according to other sources
// minimatch only matches the "bare" file name with *test*
{path: './folder/myFile.test.js', pattern: '*test*', match: false}
{path: 'myFile.test.js', pattern: '*test*', match: true}
{path: '/path/to/folder/myFile.test.js', pattern: '*test*', match: false}

// I sort of thought the dot: true setting would give the first a match.
// Otherwise this is what the app & glob test site expect *test* to do.
{path: './folder/myFile.test.js', pattern: '**/*test*', match: false}
{path: 'myFile.test.js', pattern: '**/*test*', match: true}
{path: '/path/to/folder/myFile.test.js', pattern: '**/*test*', match: true}

Is this a conscious choice by minimatch to be different? Or are the other two examples abusing a glob standard? Which is "right"? Or are house rules the norm for glob packages?

So, what you're observing is that "glob format" doesn't actually have a specification, per se. More precisely, most "glob" implementations are a superset of the IEEE Std 1003.1-2001 standard for pattern matching. What follows is my own recollection and understanding, having spent the last 13 years or so maintaining a glob implementation that endeavors to be as "correct" (for its design objective) as possible. If there are errors or omissions below, I'm happy to be corrected, but I'm confident it's mostly accurate.

The de facto format has evolved over time, starting with the original Bourne Shell (/bin/sh on many systems), and extended by the sh-like POSIX shells (ksh, ash, bash, and zsh). Bash and zsh are by far the most popular of these, and have (for the most part) identical behavior with respect to path expansion.

The biggest outlier, traditionally, has been fnmatch(3) and find(1), which do glob matching somewhat irrespective of path parts, and have been a part of the standard . So for example, find . -path '*test*' will find all the paths containing 'test' anywhere in them. (Not just ./testa/foo but even ./foo/bar/baz/test.js!)

However, in bash or zsh, you'll note that echo *test* does not match a file like ./foo/bar/baz/test.js, but will match ./testa and ./xtesty. That is because path expansion in a shell does match in a way that is respective of path parts (ie, the things between the / characters). The way to have a wildcard that matches zero or more path portions is having a ** as a single entry in a path portion. So, for example, echo **/*test*/** (assuming that globstar is enabled in the shell options, and the shell supports globstar) will result in the same results as find . -path '*test*'.

There is nothing "incorrect" about an implementation choosing to interpret globs in the fnmatch/find-style. But it is different and in my opinion, each have about the same rights to call their format "glob" as the others. It depends on the purpose and use cases that they want to solve for.

The goal of node-glob and minimatch are to provide a glob implementation that matches the behavior of the shell. That is, if echo *test* would include the file, then minimatch(path, '*test*') should return true. That means it's got to be respective of path portions; ie, bash-style, not find-style.

When I initially set off down this path of writing and maintaining a bash-style glob implementation in JavaScript, I naively assumed that libglob and fnmatch would give me all I needed. (The initial implementation of node-glob was a compiled module that called out to these libraries.) I quickly ran into the issue that (a) libglob has not been maintained for some time, (b) bash and zsh don't use it, and (c) it's pretty much impossible to get from there to "globs that work the same as the shell". What's worse, the code used by these libraries isn't really factored in a way that makes it particularly easy to just yank it out and use those bits directly, and repeatedly crossing the JS-C boundary had profoundly worse performance than relying on JS RegExp objects, which are highly optimized in the JS VM. This lead to the creation of minimatch as a pattern engine for node-glob, written entirely in JavaScript, which translates a glob pattern into an equivalent RegExp (well, really, a set of RegExps, which when applied in a specific way, produce the same results; there are a obscure few edge cases where Minimatch.makeRe() won't give you quite the same results, or quite as efficiently, especially where it regards some of the finer points of globstar behavior, but it's close enough for the vast majority of cases.)


So, tl;dr -

Is this a conscious choice by minimatch to be different?

It's a conscious choice by minimatch to use modern versions of Bash as a reference implementation, and thus different from libraries that make a different choice, yes.

Or are the other two examples abusing a glob standard?

It is arguably impossible to implement globs without "abusing" the standard (or lack thereof). But I feel like the most responsible option is to make it clear which style of globs a library is targeting, which I try to do here.

Which is "right"?

Whichever one gives you the behavior you are looking for, that's the one that's "right" for your circumstance. Normativity is always subjective. The universe does not care about our preferences, it's on us to determine what is right and wrong for ourselves.

Or are house rules the norm for glob packages?

Within reason, yeah. The best practice imo is to make it clear which glob tradition an implementation is following. If you find that one is not doing that, or claiming that they're "correct" without explaining what that means, then I'd say that's a valid complaint.

Note that globstar is not available in versions of Bash prior to (iirc?) v4, and both globstar and extglob are disabled by default. But extglob and globstar are hella useful, so this library enables them by default unless you specify noextglob/noglobstar, and targets Bash 5.latest as the reference implementation.

commented

@isaacs -- Hey, sorry, was coming back to this to grab the URL to share and noticed I hadn't replied. Shame!

Thank you very much for the extremely informative -- and also at times pretty danged funny -- response. I owe you a 🍺 equivalent at the very least. A much more interesting answer than I expected or deserved.

Thanks again.