njsmith / posy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Figuring out our relationship with the filesystem

njsmith opened this issue · comments

Currently we handle all our on-disk storage through the KVFileStore and KVDirStore abstractions. They're both basically key->value mappings, where the value is either a file or a directory respectively. They have per-key locking, and try to implement atomic updates when possible.

But these aren't necessarily the best abstractions for what we need, because I wrote them before knowing how we were going to use them. And also, while I thought they should work on Windows, it turns out there were a few details I was missing (see #4) that makes them pretty fragile, esp. in the presence of operations like filesystem indexing or AV scanning that can randomly open files. And they don't currently have any support for garbage-collecting old data.

So here's a brain dump about what actual KV stores we've ended up with and what properties each one needs.

  • hash_cache: map artifact hash -> artifact (i.e. wheel/sdist/pybi)

    • Holds: blobs
    • Access pattern: write once, must not have partial writes
    • Cleanup: can discard freely, but can't break ongoing reads, which are definitely incremental
    • Locking is useful to avoid redundant work if multiple posy invocations are running simultaneously
  • metadata_cache: maps artifact hash -> core METADATA for that artifact (useful to skip the dance required to pull it out of a remote zip file, and saves locally built metadata from sdists)

    • same properties as hash_cache, except that we always slurp in the whole file in one shot
  • wheel_cache: maps sdist hash -> directory of wheels that we've built from it

    • Holds: directory of named blobs (wheels)
    • Access pattern: each wheel inside is write once, must not have partial writes
    • Cleanup: can discard freely, but do incremental readdir and file reads
    • Locking is especially useful to avoid redundant work if multiple posy invocations are running simultaneously
  • http_cache: maps request info -> request metadata + previous response

    • Holds: blobs
    • Access pattern: read/modify/write. Currently we do streaming reads. Bodies are mostly simple API pages, so tens to hundreds of kilobytes. In the future might include other larger items, like if we decide to support somepkg @ https://.../somepkg.whl. Must not have partial writes.
    • Cleanup: can discard freely, but can't break ongoing reads -- which are currently incremental, but might be able to do one-shot slurp into buffer
  • EnvForest: maps wheel/pybi hash to unpacked tree, or sdist hash to a directory containing unpacked trees

    • Holds: whole complex directory hierarchies
    • Access pattern: write once
    • Cleanup: can only discard items that aren't currently in use by any running environment (eek -- how do we know if an environment is running?)
  • build_store: maps sdist hash -> build scratch space

    • Holds: whole directory hierarchies
    • Access pattern: arbitrary code runs and mutates whatever it wants
    • Cleanup: allocated per-process, so can just discard when done

https://stackoverflow.com/a/57358387/ has some very smart sounding comments about what actually works on Windows, and one claim is that deleting a large file on NTFS can itself be non-atomic, and a badly timed crash could leave the file truncated instead. There's also the newer and undocumented (but public) FILE_RENAME_FLAG_POSIX_SEMANTICS which.... might do something useful? Might need to experiment to figure out what this thing actually does. [Edit: Turns out you need to read the kernel-level documentation. The answer is that it lets you overwrite a destination file that still has open handles. It doesn't, AFAIK, do anything to help with the case where the source file has open handles.]

I'm worried about data integrity; specifically, we currently trust that hash_cache/metadata_cache/wheel_cache/http_cache/EnvForest are normative, so if a truncated entry ended up there then everything could become wedged permanently until someone manually clears out the caches. That would suck. (On the other hand, lack of durability is fine -- if a value gets lost, or gets truncated but we can detect that it's truncated and discard it, then that's OK; these things can all be reconstructed if needed.)

Blob storage

For the ones that store blobs, we might just want to use a full-fledged transactional store, like sqlite or bdb. Trade-offs:

  • Makes transactional integrity into Someone Else's Problem
  • At least sqlite (in WAL mode) can dramatically reduce the number of fsync's (one per WAL checkpoint, which doesn't even have to happen on every run), if you don't need durability, which we don't
  • Probably suboptimal performance for large files (hash_cache, wheel_cache). For metadata_cache and http_cache sqlite should handle them fine, possibly as BLOBs.
  • Makes insert/modify/delete all safe, but doesn't have locking to prevent redundant work (though this can be done separately)
  • Requires fiddling with SQL or whatever

The main alternative is to do something like we're doing, with one on-disk file per value, which has a few challenges.

For integrity, files require either fsync and then atomic rename, or else some sort of checksum verification so we can detect and discard corrupted values. Neither is super attractive... fsync can make writes pretty expensive, and on Windows atomic rename can be thwarted by open handles. (I guess you can sleep and retry?) Though Windows does have CreateHardLink so you could at least link the file into place and then worry about deleting the original tmp file later opportunistically. (And this can even overwrite an existing file if you use NtSetInformationFile + FILE_LINK_INFORMATION + FILE_LINK_REPLACE_IF_EXISTS + FILE_LINK_POSIX_SEMANTICS.) Checksums make writes easy and fast, but then when you open the file again you have to read through the whole thing to validate the checksum, before you know whether you have a file at all. Fast checksums can be very fast (on my laptop even sha256 goes at ~1.6 GB/s according to openssl speed, and I assume crc32c would be even faster), but that's still an extra human-perceptible lag for multi-hundred-megabyte GPU wheels, and extra I/O. (Which might get hidden by caching, if the whole file fits in cache and the OS doesn't activate dropbehind logic for sequential scans and if we're going to read the whole file anyway, like we usually will for artifacts.)

I guess another option to avoid rename on Windows would be: write the file directly to its final name, fsync, and then put a marker file next to it to record that the main file is complete and trustworthy.

Or, we can combine them: write the file to disk, fsync, and then commit a record in sqlite saying that it's there and valid.

Finally, there's the question of garbage collection: for files, I think this is actually pretty easy? Unix and Windows do both support deleting a file while letting current readers continue (for Windows you need a magic POSIX_SEMANTICS flag but it's there).

Tentatively, it seems like we might want to use sqlite for metadata_cache and http_cache, and files for hash_cache and wheel_cache.

Directories

build_store isn't a big deal, because usage is restricted to a single process. We can just create directories and write to them whenever we like. We might specifically want to avoid renaming the directory into place, to avoid Windows issues. If we add multi-threading in the future then we'll probably want some kind of locking. But that's about it.

EnvForest OTOH is... idk, maybe intractable, in two different ways:

  • When unpacking wheels/pybis into it, the only way to guarantee integrity is to fsync every file and directory. Ugh! (Well I guess on Unix we could also call sync(2) but that's gross too.) ...I guess an alternate Overly Clever solution would be to do direct I/O and regain the performance with aggressively dispatching the IO through io_uring/IOCP/threads. Maybe if you use threads to call aggressively fsync on lots of files in parallel it ends up being not so bad? Are modern FS's all clever enough to batch the journal transactions? idk. Maybe we should just give up on integrity here.

    • On the upside (?) though, the write-then-rename code we're currently using is actually unnecessary, because we hold a lock while writing the directory! Which potentially makes Windows support way easier. As long as we're prepared to handle the case where our process dies half-way through writing things out.
  • For GC'ing stuff: we have no reliable way to know whether an entry is in use. I guess when python starts up we could have it take a read-lock on all the entries it's using? But an entry can still be in use even if no python process is running, because it could be lurking in an environment variable in a non-python process (e.g. a shell), and then later it spawns a python that will expect all the entries to be there. I guess in this case the python process could at least detect at startup if anything is missing, and fail noisily? Or if we want to be ridiculously clever, it could invoke posy to fill the cache again before continuing...

Experimentally, it looks like on Windows:

  • FILE_RENAME_FLAG_POSIX_SEMANTICS lets you rename one file over another, even if both the source and destination files have other open handles
  • it also lets you rename a directory... but not if that directory has files inside it that have open handles

Windows behaviour varies by version, so you should test on the oldest version you want to support, for drawing inferences about locking etc behaviour.

Windows 10: yeah, sadly many enterprises won't be on evergreen Windows 10. There is a commonly documented oldest version to support; I suggest you don't support anything older than the Rust minimum version, (which is Windows 7 still)[https://doc.rust-lang.org/nightly/rustc/platform-support.html] :- but something newer than that, though perhaps not 'latest Windows 10'. I've had surprises where folk have a bug and the cause has been 'they work at a company'.

Some more thoughts:

  • hash cache redundant work: network is relatively cheap, but can fail; is it worth the complexity of locking things that don't exist to prevent concurrent downloads?

File system atomicity is hard. We spent a lot of time on this in bzr.

  • Anything on NFS is a crapshoot. Rather than changes on the pagecache and a nice orderly write ahead log, operations proceed in arbitrary order depending on the network protocol and conditions in play. Even close can block :/.
  • Renaming of directories has no consistent semantic. Dir onto empty dir works in some situations, not in others, dir onto full dir I think reliably fails but thats about it. The presence of open files matters except when it doesn't. Arrgh.
  • Win32 APIs are awful for this; NTKernel APIs are decent, but theres a chunk of work needed to get access to them. Most language runtimes use Win32 APIs.
  • POSIX_SEMANTICS on windows keeps the replaced file alive similar to Linux, but unlike Linux appears to keep the directory pinned still - its not unlinked but rather not named; when the last handle goes then it gets unlinked. Or something.

I think a useful thing to think about is your threat model. Cache poisoning is a classic way to make everyones life hard. How will you deal with a hostile attacker who inserts malicious content at a known key in your cache?

OS caching. Hah! Windows don't cache like you might think. Assume that every read will be actual IO, except for directory metadata. (This is down to the very different structure for IO: rather than the page cache owning IO, processes own IO. Then when the process is killed, the IO can all get unwound. This is why virus scanners that are scanning files written by a process can cause the process to suffer IO latency even when there is tonnes of memory etc to buffer the files). I (did a talk on this)[https://www.youtube.com/watch?v=qbKGw8MQ0i8].

Fsync: for rustup we don't fsync files most we write: a machine crash can be recovered by removing the toolchain and reinstalling it, and machine crashes are very rare. We do rename-into-place [not per file, but per tree], to avoid partial writes being visible to readers. Files like the config and root metadata files we fsync. This seems to work ok. Rustup is missing some locking, specifically because we don't detect-and-merge concurrent component changes (e.g. add rustfmt and rust-analyzer from separate concurrent invocations of rustup; one will get lost).

Now, less doom and gloom, some suggestions:

  • consider something like the dockerd model : have posy's core be a small API that manages its state, and is a single process. You can put a lot of concurrency complexity behind in-memory data structures, and this could even help you detect 'environment in use' if you structure the API appropriately [e.g. if posyd has to be called into to launch a process in an environment, you can track last-used datestamps].

You probably wouldn't want such a daemon to be a pseudo-init, though thats a possible version of the design.

  • consider WAL journalling : you can mark intents, particularly for deletes, and then process them in an eventually consistent fashion. Deleting large trees can be a bit slow, so you may well want to do that in a non-UI-context anyway. [there are of course multiple crates to help here :) ].

  • for unpack performance, see the video about rustup and windows defender; the tl;dr is:

    • use threads for blocking operations
    • stat is a blocking operation (e.g. NFS)
    • readdir is a blocking operation (big trees)
    • CloseHandle/close/File::drop() is also a blocking operation
    • if possible don't unpack HTML / JS files to disk, Windows defender will make you pay a CPU tax in the first instance, and if you're fast enough, it will throttle IO as its scan backlog builds up... but you can apply to get your binary whitelisted for scan-on-read rather than scan-before-write, once you can demonstrate the problem and relevance to users of your binary.

(images) The free dev images are certainly just evergreen. Possibly an MSDN membership would get you older versions to test with. Guido Or Steve Dower might be able to arrange something?