pathikrit / better-files

Simple, safe and intuitive Scala I/O

Home Page:https://pathikrit.github.io/better-files/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cooked Glob and Regex?

jaa127 opened this issue · comments

Would it to make sense to support cooked glob and regex with better-files?

At the moment better-files glob works as:

     // root_a = basedir / "a"
     // basedir / "a" / "a1" / "t1.txt"
     // basedir / "a" / "a1" / "t2.txt"

     root_a.glob("a1/*.txt").foreach(println) => finds nothing
     root_a.glob("**/a1/*.txt").foreach(println) => finds t1, t2

With cooked glob it would be:

     root_a.glob("a1/*.txt").foreach(println) => finds t1, t2

Cooked glob or regex works so that it "cooks" basepath to wildcard (glob or regex) if following is true:

  • wildcard is not absolute path
  • wildcard does not start with glob or regex special character

This cooked form makes it possible to write more natural glob, when at the beging there doesn't have to be cross-path component regex or glob. This especially important when these glob/regex are used on configuration files, where non-programming human has to understand how they work.

If this makes sense with better-files, I can provide PR for this feature with tests. There is an existing implementation here (I am author of SN127, so MIT licensing is not problem):

findFiles with support for cooked globs and regex:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/main/scala/fi/sn127/utils/fs/FileUtils.scala#L204

Glob-cooking:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/main/scala/fi/sn127/utils/fs/FileUtils.scala#L270

Regex-cooking:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/main/scala/fi/sn127/utils/fs/FileUtils.scala#L288

Glob-cooking tests:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/test/scala/fi/sn127/utils/fs/GlobTest.scala

Glob-findFiles tests:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/test/scala/fi/sn127/utils/fs/GlobTest.scala#L236

Glob-findFiles target:
https://github.com/sn127/utils/tree/b116036de96f7b66fba29117ce91168bf4323c45/tests/globtree

And finally here is an example how this cooked form is used in "end-product". This is DirSuite scalatest extension, which let you define your tests as inputs and output references on filesystem:

https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/testing/src/test/scala/fi/sn127/utils/testing/DirSuiteDemo.scala#L49

Hmm interesting. Although this certainly seems useful, I have never came across "glob cooking". Is this something that is known outside sn127? If not, I don't think better-files is the right place for this?

If we were to incorporate it into better-files we have couple of options:

  1. Add a boolean cook parameter to the File.glob util

OR

  1. Add new PathMatcherSyntax here: http://pathikrit.github.io/better-files/latest/api/better/files/File$$PathMatcherSyntax$.html

I would suggest the latter.

git's gitignore works kind of same way. With it you don't have to provide path prefix-glob, if gitignore is inside subdir. If it is on top level, then there has to be path-prefix glob. So git in some sense "cooks" current directory to the glob.

On public software side of things, this is used with DirSuite, which is used by Abandon.

Why this has been handy so far:

  • This glob-cooking has been really handy so far and especially it makes conf-settings look more clear when you don't have to count starts at the begin
  • It is marginally faster, because the begin of regex is fixed string instead of wildcard, and matcher doesn't have to scan whole string (it can exit early). This is probably totally negligible in real life.

If this lands on better-files, then new PathMatcherSyntax would definitely makes sense. Then it would be clear which one it is and it would be also possible to find direct child sub-directories with wildcard, without matching deeper subdirectories.

 basedir.cookedGlob("*/12/**.txt")

And when does that happen? If e.g. you shard iso-dates by year, month, and day and you have to find all items from December (12) over multiple years. With normal glob, if there is path-prefix crossing glob as first wildcard, it will match all days which are "12", if there isn't path-prefix crossing glob as first wildcard, it won't match anything.

Maybe this could be thought as ls command, you don't have to provide **/*.txt because you are already inside directory. With bettern-files syntax this is even more important (imho):

List all *.txt under a1:

   a1.glob("**/*.txt")

vs.

   a1.glob("*.txt")

Based on above, maybe it could be argued that cooked glob should be default, and non-cooked could be rawGlob?

@jaa127 : Okay let's put this in better-files and make it default. Let's avoid the word "cooking" since that seems non-standard.

Since we would be breaking backwards compatibility, this needs to go in v3

Also, please document this in the README since it would deviate from UNIX/Java's glob behaviour.

That's great, thanks! If there are some oddities or if we have second thoughts about this to be default, then this can be revisited later, before releasing.

Do you have an idea when v3.0.0 should be ready (in days, weeks, months)?

Do you have an idea when v3.0.0 should be ready (in days, weeks, months)?

Weeks.

If you want to use it now, you can depend on 2.17.2-SNAPSHOT

Hi, here is the PR. It's good if there are some time before 3.0.0, so there is time to adjust this if needed.

I noticed while looking some external references that new Python glob works same way than this implementation. Here is Python glob doc: https://docs.python.org/3/library/glob.html

There is one difference between python's glob.glob('**/*.txt', recursive=True) implementation and this implementation: Python lists all files also in current directory, even when there is that slash. Java doesn't do that.

It could be that Python is in fact wrong in that case, because ls **/*.txt works same way than this implementation.

ls **/*.txt
a/a.txt  a/x.txt  b/b.txt  c/c.txt  c/x.txt