andrewchambers / bupstash

Easy and efficient encrypted backups.

Home Page:https://bupstash.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data corruption error - (possibly from stale caches)

AndrolGenhald opened this issue · comments

I was doing a test to see if bupstash would work for me, and my restore command simply hangs partway through. The timer keeps counting and it says "fetching files...", but nothing happens.

I must have gotten exceptionally (un)lucky on my first try. I ran a script on a tmpfs generating, backing up, and restoring 1GiB of random data at a time for over a terabyte of data, and it never happened again.

I was able to do a manual binary search on the original failure to whittle the 7GiB down to just 5 data files with about 3MiB total.

Here's the backup and a script to reproduce, I've tested it on multiple machines and it fails every time:
deadlock-repro.tar.gz

# After downloading deadlock-repro.tar.gz
tar -xzf deadlock-repro.tar.gz
cd bupstash-deadlock-repro
bupstash restore -k test.key -r backup --into restore name=test

Running strace on it shows it waiting on a futex call.

Thank you so much - I will be able to dig into this over the weekend.

I added critical bug because I note that bupstash is returning data corrupt on 'bupstash get'. I can't imagine why this would happen - list-contents is successful so it means something is potentially wrong with a data chunk in your repository.

@AndrolGenhald could you confirm which version of bupstash, platform and confirm the disk your repository is on has free space?

@AndrolGenhald Another question, you mentioned the backup had 5 files, but I see only one in the output of 'list-contents' - could you confirm the output of list-contents is incorrect? If possible could you also send the original reduced files and the invocation used to make the backup?

I added critical bug because I note that bupstash is returning data corrupt on 'bupstash get'. I can't imagine why this would happen - list-contents is successful so it means something is potentially wrong with a data chunk in your repository.

Oh, that's actually worse than I thought. The obvious answer would be filesystem issues on my end, but I've reproduced it with separate keys and multiple separately inited repositories, although for separate repositories the reproduction isn't as consistent. It also fails at different points each time I use a separate repository. I think this probably rules out filesystem corruption.

@AndrolGenhald could you confirm which version of bupstash, platform and confirm the disk your repository is on has free space?

Freshly built 0.12.0 with cargo build yesterday, on Ubuntu using ext4 with plenty of free space.

@AndrolGenhald Another question, you mentioned the backup had 5 files, but I see only one in the output of 'list-contents' - could you confirm the output of list-contents is incorrect? If possible could you also send the original files and the invocation used to make the backup?

Sorry, that was a bit unclear, I was referring to the data directory in the repository. There is only one file in the backup.

It feels like you might have hit some corner case and interaction with the deduplication that wasn't caught by the existing fuzzing - could you upload the original test file and how you reproduce?

More culprits could be the send-log has somehow become incorrect, or a race condition with the multithreading.

The reproduction is less consistent when using a new repository and it tends to be in a different spot of the total 7GiB each time. I'll see if I can create a smaller reproduction from just a test file and a freshly inited repository tomorrow.

I don't think it's the send-log, I was able to reproduce it without using a send-log.


I think some of the error handling could be improved a bit, I noticed a couple of times that the backup succeeded, but the repository was only 6GiB instead of 7GiB, and it can't be due to compression since it's random data. I would expect an error if something fails to back up.

Yes - that is absolutely not expected - bupstash is very explicit in aborting on errors with the exception of files it finds have been modified or removed during indexing but before reading.

I will update the test suite with any issues we can reproduce and issue a new release immediately once they are fixed.

Actually ended up being easier than I thought, the reproduction seems to be dependent on both the key and the inited repository, but I managed to reproduce it from initial backup repo with nothing in it, the key, and a 4MiB file.

corruption-repro-2.tar.gz

tar -xzf corruption-repro-2.tar.gz
cd corruption-repro-2
bupstash put -k test.key -r backup name=test data
bupstash list-contents -k test.key -r backup name=test

I end up with "no stored items with the requested id" and a 48KiB backup directory. It's not always consistent, it's worked correctly a few times, and one time list-contents listed the file but get gave a "No such file or directory" error.

We need to be careful when copying repositories (including with tar) as the send-logs and query cache in ~/.cache/bupstash/ will become out of date. The caches are tied to the id stored in ./backup/meta/gc_generation.

You can test this by using the --query-cache ./t.qcache to make a fresh query cache for each test.

I think I will add a safety check to prevent that sort of issue. I am still not sure if this was the cause of your original corruption problem though.

Oh, that might be part of my issue then. I was copying repositories liberally and I hadn't thought to check ~/.cache. I think there might still be a reproducible error even with a clean cache though, I'm pretty sure I've run into issues directly after a bupstash init and bupstash put without ever copying the repository.

I will make sure the cache invalidates if the inode of the repository changes - this should prevent this from ever being an issue. We can keep investigating if corruption happens again.

Related to #288

I've updated the ticket tags and name - I will add critical-bug again if we can reproduce corruption with a fresh repository and --no-send-log or we can rule out a stale send log caused by copying fully. I also want to understand how a stale send log could cause such a corruption error and not just a 'chunk not found' type error.

Is it possible that using the same send-log with the main key and a sub-key or with two different sub-keys would cause corruption? I only noticed issues after testing out sub-keys, and I don't believe I removed the send-log in between. The caches should probably also be invalidated if used with a different key, if that doesn't happen already.

It's worth investigating, there is a chance of a bug there - sub keys are probably not tested as much as just using a single key. In general they should be in a completely different 'hash space' and never have colliding chunks.

I'm mostly convinced that my issues all stem from stale caches, but I'll continue testing next week to see if I can reproduce any issues without stale caches. I think most people would expect stale caches to not cause problems, but with bupstash's security model I think a put sub-key actually has no way to tell if the cache is valid, so that makes things a bit more difficult.

Hopefully just checking the inode will be enough to prevent people from shooting themselves in the foot, but maybe something should be added to the documentation as well? Another possible issue with that is filesystems like sshfs, where the inode numbers could change each time it's mounted, although the documentation does already make it clear that you shouldn't use sshfs to mount a bupstash repository, and the only real issue there would be that the cache is useless, which wouldn't cause any corruption, just slower backups.

Very good points - I think the lack of documentation on this was definitely a problem. I also note that if you hit this issue there might have been 10 more who did and never mentioned it. Ideally users should be totally protected from this sort of misuse as its beyond the level of understanding I expect for typical users.

I will have to think a bit about inodes that change, there might be other ways we can detect copies too (for example mtime of certain files that should never be modified manually).

I haven't been able to reproduce any corruption issues without using a stale cache, so I'll go ahead and close this. Thanks for all the help!

No problem, I will still add a the cache fix to resolve the other ticket. I really appreciate bug reports.