Resurrecting an ancient project, a "mostly read-only file management tool". It's intended for keeping a large list of checksums in a database so that duplication, movement, and corruption of files can be detected. In addition to maintaining a singular database, it also offers cross-database functionality.
We speak of "observations" to mean an association of a file path and its contents (or, at least, their cryptographic checksum). Most operations on the checksum database pertain to one or more observations.
This program is just a shim around a database; it does not interact with the filesystem much itself. Instead, it should be used in composition with things like find
and the GNU coreutils digest programs (e.g. sha512sum
), delegating details of filesystem traversal and choice of hash and so on to the user.
This program requires...
- either the Lua 5.3 interpreter or luajit,
- the Lua
argparse
andpenlight
libraries, and lua-dbi
and itslua-dbi-sqlite3
driver.
To reduce clutter, many of the examples here rely on cdb
's ability to pull the default database from the $CDB
environment variable. If that's not what you want, add --db ${DB}
to the invocation of cdb
.
cdb init
Add the checksum of a single path to the database. This will create a new checksum and/or a new path identifier as needed and will bind them together. :
sha512sum $FILE | cdb addh
Or, for all files under a path:
find $DIR -type f -exec sha512sum {} \+ | cdb addh
If we have a pile of digest files already, each of which contains digests of paths relative to its location, we can generate a database, ${DB2}
from them with the assistance of the cdb-util digest-relativize
tool (see below <cdb-util_drel>
):
find ${DIR} -type f -name SHA512SUMS -print0 | cdb-util drel -1 | cdb addh
Measure the checksum of a path and confirm that the database already held that observation. Reports unexpected files as well as mis-checksummed contents. :
sha512sum $FILE | cdb verh
Or for all files under a path:
find $DIR -type f -exec sha512sum {} \; | cdb verh
This processing of digest streams is to be preferred to verifying a digest stream as generated by the database, e.g.:
cdb look \* | sha512sum -c
because the former can be more informative in the case of mismatching digests (specifically, the database can look for other paths that have the reported digest). If it's easier to have the database generate the set of files, that can be done:
cdb look \* --format '$u$z' --nul | xargs -0 sha512sum | cdb verh
We can augment a database of files by filtering a list of files we have to exclude the list of files we know about. If, however, there is a possibility that some of these files are duplicates of ones already in the database, you may be better off using ingest
reflexively.
We can generate the list of files we don't know about using find
and cdb filterpath
:
find ${DIR} -type f -print0 | \
cdb filterpath -1 -P -p out -0 -f '$u$z' > ${DB}.new-files0
We can then script computing those files' checksums and adding the new reports to the database:
xargs -0 sha512sum > ${DB}.new < ${DB}.new-files0
cdb addh < ${DB}.new
For a different approach, we can quickly construct a "just paths" database, which associates all paths with a single digest, from the current state of the file system as follows:
cdb --db ${JPDB} init
find ${DIR} -type f -printf "0 %p\\0" | cdb --db ${JPDB} addh --inul
This database may not seem very useful, but when combined with cdb --db diff
we can quickly find all paths whose checksums are unknown to the database:
cdb diff ${JPDB} --no-headers --flavor=path --which=super --format '$u$z' -0 > ${DB}.new-files0
And then proceed as above.
If we have another database that knows digests for our files, rather than computing digests again, we can extract checksums from ${DB2}
and install them into ${DB}
:
cdb --db ${DB2} look --inul < ${DB}.new-files0 | cdb --db ${DB} addh
Armed with a "just paths" database as per the above, we can then direct the database to prune tracked paths not in the "just paths" database if the hashes are observed elsewhere:
cdb diff ${JPDB} --flavor=path --which=sub --no-headers --format '$u$z' --nul > ${JPDB}.missing-files0
cdb domv --inul < ${JPDB}.missing-files0
cdb gc > ${DB}.gc
sqlite3 ${DB} < ${DB}.gc
Given a path prefix (possibly empty), report all logged observations below that path of contents that exist in multiple locations (i.e., files with checksum collisions).
Cease to consider a particular path part of the database and remove all observations made of it. Since this application is primarily for data hoarders who tend not to delete things, one should prefer to Respond to File Moves
<Responding to File Moves>
rather than risk removing the last observation of a given hash.
Indicate that some file contents are to be considered a lesser version of some other contents:
cdb addsuper /old/path /new/path
After this command is run, domv
will be willing to remove the /old/path
entry from the database. .. TODO
Superseder records can also be added from stdin
using addsuperhash
(or addsh
). This command reads in lines of the form :
old-digest new-digest notes
The notes
field extends to the end of the record; if newlines are desired in the recorded notes, use --inul
(-1
) and separate records by NUL bytes.
Given a digest stream, partition it into hashes already in the database and hashes novel to the database. For the former, optionally generate rm
commands, and for the latter, optionally generate mv
or cp
commands to import into the library. Novel hashes, and their new paths, may optionally be recorded as well, to be subsequently added to the database:
find /source/path -type f -exec sha512sum {} \+ | \
cdb ingest --target /new/path --prune
This will produce a stream of shell commands to copy files given by find
into the /new/path
directory (using their basename therein). Passing --move
generates move rather than copy commands. Passing --prune
additionally issues rm
commands for source files whose hashes collide with something already in the database.
The --digest-log FILE
option will cause import
to write to FILE every new digest encountered in the stream, associated with its new name in /new/path
. This can then be fed back through addhash
without needing to recompute digests.
ingest
knows how to quote paths for safe handling by POSIX shells (though its mechanism is somewhat crude and not always great for human consumption). However, POSIX shells are willing to forgive control characters in quoted strings while humans and terminals are more likely to make a mess of things. The --escape {posix,extended,human}
option will change how ingest
quotes such characters.
The ingest
command can also be used "reflexively" on the managed collection of files to either add files that are not tracked or prune files that have presence elsewhere in the database. We can enumerate files not tracked using filterpath
and compute their checksums as we did in Add Missing Checksums above:
find ${DIR} -type f -print0 | \
cdb filterpath --in-path --predicate=out -0 -1 --format '$u$z' | \
xargs -0 sha512sum > ${DB}.new
We can then prepare to prune duplicates and add unique files:
cdb ingest --prune --inplace --digest-log ${DB}.new2 < ${DB}.new > ${DB}.prune
Add new files to the database with:
cdb addh < ${DB}.new2
Inspect the pruning commands to be run, and then execute them with:
sh < ${DB}.prune
(If you have, or might have, unusual path names, you may be better served with --prune-log
rather than --prune
. The resulting, NUL
-terminated list of files can be inspected with cdb-util escape human -0
and run with xargs -0 -- rm --
.) Other Included Utilities ########################
The cdb-util
program contains utilities for manipulating digest streams and may grow to include other tools not directly relevant to manipulating cdb
databases.
AKA dpre
, this command filters a digest stream by adding a prefix to relative paths within. For example, while :
sha512sum *
generates a stream that uses relative names, both of these forms should produce absolute names:
sha512sum $PWD/*
sha512sum * | cdb-util dpre --prefix $PWD
AKA dfex
, this command filters a digest stream, limiting it to files that exist. This may be useful if one is ingesting files in stages.
AKA drel
, this command is a "recursive digest-prefix
": given a stream of names of digest stream files on stdin
, this utility opens each and prefixes the paths therein by the path naming the stream. The various streams involved can be made NUL
-terminated rather than newline terminated (with escapes) with:
- The usual
--nul
(-0
) continues to affect the output stream (stdout
), - The usual
--inul
(-1
) continues to affect the input stream of digest file names (stdin
), - The new
--fnul
(-2
) indicates that the digest files read internally areNUL
-terminated records.
AKA esc
, this command maps input records in various ways to make them safe for consumption by shells or similar. This tool is largely a test hook, but is exposed in case it is useful. The --how
parameter dictates the transform in question:
posix
escapes strings such that they will be correctly inpterpreted by POSIX shells, using single-quotes whenever possible (except when escaping single quotes, which get escaped with double quotes). This transform will leave non-printing characters in place, including newlines!extended
escapes strings using$'\xHH'
notation, understood by many *NIX shells. Non-printing and non-ASCII bytes are escaped, which can make this somewhat more invasive than might be desired.human
tries to escape strings using a somewhat messy, but Unicode-aware policy, preserving non-ASCII graphemes where possible, especially when names don't include shell metacharacters.digest
performs a GNU coreutils digest stream escaping.