cea-hpc / rbh-find

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rbh-find

rbh-find is a close twin of find(1).

Installation

Install the RobinHood library first, then download the sources:

git clone https://github.com/cea-hpc/rbh-find.git
cd rbh-find

Build and install with meson and ninja:

meson builddir
ninja -C builddir
sudo ninja -C builddir install

A work in progress

rbh-find is a close twin of find(1). At least it aims at being one. Right now, it is more like the child version of find's twin.

The structure of the project is all done. The rest should be easy enough. Feel free to contribute!

Usage

rbh-find works so much like find that documenting how it works would look a lot like find's man page. Instead, we will document the differences of rbh-find compared to find. That and a few examples should be enough for users to figure things out.

Examples

The following examples all assume you have a backend set up at rbh:mongo:test. [1]

# find every file with a txt extension
rbh-find rbh:mongo:test -type f -name '*.txt'

# find every .git directory
rbh-find rbh:mongo:test -type d -name '.git'

# find files modified today
rbh-find rbh:mongo:test -mtime 0

# find setuid bit
rbh-find rbh:mongo:test -perm /u+s
[1]to set up a backend, have a look at rbh-sync's documentation

The following examples all showcase the use of the -size predicate.

# list all entries with a size of exactly 2 Gigabytes
rbh-find rbh:mongo:test -size 2G
/testA

# list all entries with a size greater than 1 Megabytes
rbh-find rbh:mongo:test -size +1M
/
/testA
/testB
/testC

# list all entries with a size greater than 1 Megabytes and smaller
# than 1 Gigabytes
rbh-find rbh:mongo:test -size +1M -size -1G
/testB

Paths vs. URIs

The most obvious difference between find and rbh-find is the use of URIs instead of paths:

find /scratch -name '*.txt'
rbh-find rbh:mongo:scratch -name '*.txt'

rbh-find queries RobinHood backends rather than locally mounted filesystems. The canonical way to refer to backends and the entries they manage are URIs. Hence rbh-find uses URIs rather than paths.

For more information, please refer to the RobinHood library's documentation on URIs.

FS traversal vs. Backend filtering

gnu-find can be compared to a configurable sorting machine.

For example, when running the following command:

find -type f -name '*.txt' -print

The first thing find does is build a tree -- or rather, a pipeline -- of its command line's predicates (-type f, -name '*.txt') and actions (-print):

                                       true  --------- (always) true  -----
                                          -->| print |--------------->| ø |
               true  -------------------  |  ---------                -----
                  -->| name =~ ".txt$" |--|
----------------  |  -------------------  |  -----
| type == FILE |--|                       -->| ø |
----------------  |  -----            false  -----
                  -->| ø |
              false  -----

Then it traverses the current directory (because "." is implied), and its subdirectories, and their subdirectories, ... And each filesystem entry it encounters goes through the pipeline. Once.

Now, find allows you to place multiple actions on the command line:

find -print -print

This is also converted into a single tree:

--------- (always) true  --------- (always) true  -----
| print |--------------->| print |--------------->| ø |
---------                ---------                -----

And each entry is still only processed once (it is printed twice, but iterated on once).

rbh-find works a little differently. Since it uses RobinHood backends, it can query all the entries that match a set of predicates at once, rather than traverse a tree of directories looking for them. But it cannot ask the backend to run actions on those entries: it has to perform them itself.

The execution flow looks like this:

---------   ----------
| query |-->| action |
---------   ----------

And when there are multiple actions:

-----------   ------------   -----------   ------------
| query-0 |-->| action-0 |-->| query-1 |-->| action-1 |
-----------   ------------   -----------   ------------

Where query-1 is a combination of query-0 and whatever predicates appear between action-0 and action-1.

Another approach would be to fall back to a regular find pipeline after action-0. But this would require reimplementing all the filtering logic of find, and there is no garantee that it would be faster than issuing a new query. So rbh-find does not do it that way.

But what are the consequences of such a choice?

There are three:

  1. for every action, rbh-find sends one query per URI on the command line;
  2. rbh-find's output is not ordered the same way find's is;
  3. rbh-find's actions do not filter out any entries.

An example of the difference in the output ordering:

find -print -print
./a
./a
./a/b
./a/b
./a/b/c
./a/b/c

rbh-find rbh:mongo:test -print -print
./a
./a/b
./a/b/c
./a
./a/b
./a/b/c

The third difference is probably the most problematic. In all the previous examples, we used the action -print which always evaluates to true and so does not filter out any entries. But there are other actions that do exactly that:

# find every file that contains 'string'
find -type f -exec grep -q 'string' {} \; -print

The same query, ran with rbh-find would simply print each file and directory under the current directory. Implementing the same behaviour as find is not impossible: it would simply require keeping track of entries that "failed" actions and exclude them from the next queries. But remembering those entries could prove prohibitively expensive in terms of memory consumption. Moreover the time to build the queries would increase as we exclude more and more entries.

-amin, -cmin, and -mmin

find's -[acm]min predicates do not work quite like -[acm]time in terms of how the time boundaries are computed. There is no apparent reason for this.

rbh-find uses the same method for all 6 predicates which it borrows from find's -[acm]time.

-size

rbh-find's -size predicate works exactly like find's -size, but with the addition of the T size, for Terabytes.

-perm

The implementation is still a work in progress as some differences with GNU find still exist.

rbh-find's -perm predicate works like GNU find's except that GNU find supports '-', '/' and '+' as a prefix for the mode string. The '+' is deprecated and not used by GNU find but does not trigger a parsing error. Whereas, it is a parsing error to use '+' in rbh-find as a prefix. Keep in mind that some symbolic modes start with a '+' such as '+t' which corresponds to the sticky bit. This '+' sign represents the operation to perform as '-' and '=' not the prefix and is the reason for the deprecation of '+' as a prefix.

So looking for all the files with a sticky bit could be done with /+t. And +t would match on file with only the sticky bit set and no other permission.

Extra features

-count

rbh-find defines a -count action that pretty much does what you would expect: count the matching entries.

# count the file with a '.c' or '.h' extension
rbh-find rbh:mongo:test -type f -name '*.c' -o -name '*.h' -count
71 matching entries

The message format is not yet stable. Please do not rely on it.

-sort/-rsort

rbh-find defines the -sort and -rsort options which allow sorting entries based on their name, last access time, ... in ascending and descending order.

rbh-find rbh:mongo:test -sort name
./
./a
./b
./c
rbh-find rbh:mongo:test -rsort name
./c
./b
./a
./

-sort and -rsort affect the actions that they precede, irrespective of logical operators: parentheses, !, -or, and -and.

For example,

rbh-find uri -type f -sort a -name '* .txt' -sort b -o \
  \(-size + 1M -sort c -o -size -1K -sort d \) -print \
  -sort e -print

Is equivalent to: .. code:: bash

rbh-find uri -type f -name '* .txt' -o
(-size +1M -o size -1K ) -sort a -sort b -sort c -sort d -print -sort e -print

Depending on the backend and the field being sorted on, this option may provide orders of magnitude faster results than sorting entries after the fact. That is because, for database-like backends, ordering entries on an indexed field is usually (if not always) an efficient process.

For technical reasons, not every backend supports sorting, and those which do, may not be able to in every situation. For example, at the time of writing, the mongo backend does not support sorting for fragmented URIs:

rbh-find rbh:mongo:test -sort type
./
./dir-0
./dir-1
./dir-0/file-0
./dir-0/file-1
./dir-1/file-2
./dir-2/file-3

rbh-find rbh:mongo:test#dir-0 -sort type
rbh-find:../rbh-find.c:81: filter_fsentries: Operation not supported

In these cases, short of finding a tricky way to achieve the same result:

rbh-find rbh:mongo:test#dir-0 -type d -print -o -type f -print
./
./dir-0
./dir-0/file-0
./dir-0/file-1

You will have to resort to manually sorting the output:

rbh-find rbh:mongo:test#dir-0 -printf "%y %p\0" | sort -zsk1,1 |
    cut -zd' ' -f2- | tr '\0' '\n'
./
./dir-0
./dir-0/file-0
./dir-0/file-1

About

License:GNU Lesser General Public License v3.0


Languages

Language:C 82.4%Language:Shell 14.7%Language:Meson 2.9%