Option to store uid, gid and perms in cache file

Question

Option to store uid, gid and perms in cache file

jonjacksonma opened this issue 3 months ago · comments

Hi.
It would be great to have the option for qdirstat-cache-writer to have the option of storing uid, gid and permissions in the cache file. The use-case is a large shared filesystem where you typically want to know who owns the largest files/folders after identifying them, and where generating the cache file ahead of time via cron is necessary to load the tree in a reasonable time.
thanks,
Jon

Stefan Hundhammer · Answer 1 · Thu Feb 29 2024 23:34:05 GMT+0800 (China Standard Time)

That would mean making the file format incompatible to older versions, which I am very reluctant to do: It is also in use for some backup software because it's so simple.

In environments where large shared filesystems are still a thing (I haven't seen many of those over the last 20 or so years), isn't directory ownership typically implicit with the path? Do you really have directory trees where various users create files and directories all over the place? Don't departments, teams and users get some subtree assigned where they have permissions to create their own files and directories?

Stefan Hundhammer · Answer 2 · Thu Feb 29 2024 23:47:38 GMT+0800 (China Standard Time)

The file format specification:

https://github.com/shundhammer/qdirstat/blob/master/doc/cache-file-format.txt

Adding another three fields is pretty trivial, of course. It would be a numeric (!) UID, GID, and octal permissions. I recall something about an UID <-> user name mapping service for NFS for environments where they may be different on different machines; that would be out of the question.

The file format already has a version number in its header, which helps to identify which parser to use. But QDirStat would need to retain backwards compatibility with the old file format; that makes the code a bit uglier.

Stefan Hundhammer · Answer 3 · Fri Mar 01 2024 05:30:56 GMT+0800 (China Standard Time)

Writing UID, GID and permissions works now in the Perl qdirstat-cache-writer, and the QDirStat binary can read them. Please ~~check out and build the huha-cache-uid branch and~~ do some initial testing.

~~Writing the new format with the QDirStat built-in cache writer will come tomorrow.~~ Done.

Docs for the new file format here. The format has also become a bit prettier and easier to read for humans.

qdirstat-cache-writer now also has two new command line options to enforce the old format V1.0 with -1, and the new format V2.0 with -2 (for completeness; it's the new default).

Stefan Hundhammer · Answer 4 · Fri Mar 01 2024 21:21:30 GMT+0800 (China Standard Time)

This is now merged to master.

Jon Jackson · Answer 5 · Sat Mar 02 2024 05:21:49 GMT+0800 (China Standard Time)

Thanks for adding these options, it will be extremely helpful. The new options -1 and -2 for qdirstat-cache-writer work perfectly. There seems to be an issue with -l though. When read into QDirStat gui, there are spurious files with names that are 3 or 4 digit numbers and size of 1.8GB

Tested with QDirStat 1.9.01-git

Stefan Hundhammer · Answer 6 · Sat Mar 02 2024 07:28:58 GMT+0800 (China Standard Time)

Oops... that was a missing whitespace delimiter in that long format between the type and the name/full path; so both were conflated into what appeared to be one single field like F/foo/bar/myfile, and the subsequent fields on that line all moved up one position.

This is now fixed.

Stefan Hundhammer · Answer 7 · Sat Mar 02 2024 07:32:10 GMT+0800 (China Standard Time)

BTW you can easily look into a cache file, even if it's gzipped: Just use zless, zcat or zgrep. You don't even need to gunzip it.

Stefan Hundhammer · Answer 8 · Sat Mar 02 2024 07:40:09 GMT+0800 (China Standard Time)

Fun fact: That whole thing moved the fields of the affected lines one position up, so some other was interpreted as the size; and time_t of today is around 0x65e1d6ff (seconds since 1970-01-01 00:00) which translates to just about 1.6 GB. :-)

Jon Jackson · Answer 9 · Tue Mar 05 2024 06:20:19 GMT+0800 (China Standard Time)

I was wondering how the fields ended up mapped... Thanks for the fix.

This doesn't affect functionality at all but while testing I noticed that the cache file itself gets listed in the cache with a size of 257 bytes, i.e. the size while it's still being written to. Could be a cosmetic improvement to exclude the cache file from listing

Stefan Hundhammer · Answer 10 · Tue Mar 05 2024 08:22:05 GMT+0800 (China Standard Time)

Well, it's there in that directory, so of course it will be listed. And yes, of course this is just a snapshot in time, and a moment later the size may be different; like with all files on a modern OS.