Data corruption error - (possibly from stale caches)

Question

Data corruption error - (possibly from stale caches)

AndrolGenhald opened this issue 2 years ago · comments

I was doing a test to see if bupstash would work for me, and my restore command simply hangs partway through. The timer keeps counting and it says "fetching files...", but nothing happens.

I must have gotten exceptionally (un)lucky on my first try. I ran a script on a tmpfs generating, backing up, and restoring 1GiB of random data at a time for over a terabyte of data, and it never happened again.

I was able to do a manual binary search on the original failure to whittle the 7GiB down to just 5 data files with about 3MiB total.

Here's the backup and a script to reproduce, I've tested it on multiple machines and it fails every time:
deadlock-repro.tar.gz

# After downloading deadlock-repro.tar.gz
tar -xzf deadlock-repro.tar.gz
cd bupstash-deadlock-repro
bupstash restore -k test.key -r backup --into restore name=test

Running strace on it shows it waiting on a futex call.

andrewchambers · Answer 1 · Fri Feb 03 2023 14:55:07 GMT+0800 (China Standard Time)

Thank you so much - I will be able to dig into this over the weekend.

andrewchambers · Answer 2 · Fri Feb 03 2023 15:01:56 GMT+0800 (China Standard Time)

I added critical bug because I note that bupstash is returning data corrupt on 'bupstash get'. I can't imagine why this would happen - list-contents is successful so it means something is potentially wrong with a data chunk in your repository.

andrewchambers · Answer 3 · Fri Feb 03 2023 15:04:32 GMT+0800 (China Standard Time)

@AndrolGenhald could you confirm which version of bupstash, platform and confirm the disk your repository is on has free space?

andrewchambers · Answer 4 · Fri Feb 03 2023 15:11:02 GMT+0800 (China Standard Time)

@AndrolGenhald Another question, you mentioned the backup had 5 files, but I see only one in the output of 'list-contents' - could you confirm the output of list-contents is incorrect? If possible could you also send the original reduced files and the invocation used to make the backup?

AndrolGenhald · Answer 5 · Fri Feb 03 2023 15:25:51 GMT+0800 (China Standard Time)

I added critical bug because I note that bupstash is returning data corrupt on 'bupstash get'. I can't imagine why this would happen - list-contents is successful so it means something is potentially wrong with a data chunk in your repository.

Oh, that's actually worse than I thought. The obvious answer would be filesystem issues on my end, but I've reproduced it with separate keys and multiple separately inited repositories, although for separate repositories the reproduction isn't as consistent. It also fails at different points each time I use a separate repository. I think this probably rules out filesystem corruption.

@AndrolGenhald could you confirm which version of bupstash, platform and confirm the disk your repository is on has free space?

Freshly built 0.12.0 with cargo build yesterday, on Ubuntu using ext4 with plenty of free space.

@AndrolGenhald Another question, you mentioned the backup had 5 files, but I see only one in the output of 'list-contents' - could you confirm the output of list-contents is incorrect? If possible could you also send the original files and the invocation used to make the backup?

Sorry, that was a bit unclear, I was referring to the data directory in the repository. There is only one file in the backup.

andrewchambers · Answer 6 · Fri Feb 03 2023 15:29:16 GMT+0800 (China Standard Time)

It feels like you might have hit some corner case and interaction with the deduplication that wasn't caught by the existing fuzzing - could you upload the original test file and how you reproduce?

More culprits could be the send-log has somehow become incorrect, or a race condition with the multithreading.

AndrolGenhald · Answer 7 · Fri Feb 03 2023 16:24:28 GMT+0800 (China Standard Time)

The reproduction is less consistent when using a new repository and it tends to be in a different spot of the total 7GiB each time. I'll see if I can create a smaller reproduction from just a test file and a freshly inited repository tomorrow.

I don't think it's the send-log, I was able to reproduce it without using a send-log.

I think some of the error handling could be improved a bit, I noticed a couple of times that the backup succeeded, but the repository was only 6GiB instead of 7GiB, and it can't be due to compression since it's random data. I would expect an error if something fails to back up.

andrewchambers · Answer 8 · Fri Feb 03 2023 16:28:01 GMT+0800 (China Standard Time)

Yes - that is absolutely not expected - bupstash is very explicit in aborting on errors with the exception of files it finds have been modified or removed during indexing but before reading.

I will update the test suite with any issues we can reproduce and issue a new release immediately once they are fixed.

AndrolGenhald · Answer 9 · Fri Feb 03 2023 16:49:45 GMT+0800 (China Standard Time)

Actually ended up being easier than I thought, the reproduction seems to be dependent on both the key and the inited repository, but I managed to reproduce it from initial backup repo with nothing in it, the key, and a 4MiB file.

corruption-repro-2.tar.gz

tar -xzf corruption-repro-2.tar.gz
cd corruption-repro-2
bupstash put -k test.key -r backup name=test data
bupstash list-contents -k test.key -r backup name=test

I end up with "no stored items with the requested id" and a 48KiB backup directory. It's not always consistent, it's worked correctly a few times, and one time list-contents listed the file but get gave a "No such file or directory" error.

andrewchambers · Answer 10 · Fri Feb 03 2023 17:10:44 GMT+0800 (China Standard Time)

We need to be careful when copying repositories (including with tar) as the send-logs and query cache in ~/.cache/bupstash/ will become out of date. The caches are tied to the id stored in ./backup/meta/gc_generation.

You can test this by using the --query-cache ./t.qcache to make a fresh query cache for each test.

I think I will add a safety check to prevent that sort of issue. I am still not sure if this was the cause of your original corruption problem though.

AndrolGenhald · Answer 11 · Fri Feb 03 2023 17:15:56 GMT+0800 (China Standard Time)

Oh, that might be part of my issue then. I was copying repositories liberally and I hadn't thought to check ~/.cache. I think there might still be a reproducible error even with a clean cache though, I'm pretty sure I've run into issues directly after a bupstash init and bupstash put without ever copying the repository.

andrewchambers · Answer 12 · Fri Feb 03 2023 17:16:45 GMT+0800 (China Standard Time)

I will make sure the cache invalidates if the inode of the repository changes - this should prevent this from ever being an issue. We can keep investigating if corruption happens again.

andrewchambers · Answer 13 · Fri Feb 03 2023 17:18:11 GMT+0800 (China Standard Time)

Related to #288

andrewchambers · Answer 14 · Fri Feb 03 2023 17:43:22 GMT+0800 (China Standard Time)

I've updated the ticket tags and name - I will add critical-bug again if we can reproduce corruption with a fresh repository and --no-send-log or we can rule out a stale send log caused by copying fully. I also want to understand how a stale send log could cause such a corruption error and not just a 'chunk not found' type error.

AndrolGenhald · Answer 15 · Fri Feb 03 2023 18:05:07 GMT+0800 (China Standard Time)

Is it possible that using the same send-log with the main key and a sub-key or with two different sub-keys would cause corruption? I only noticed issues after testing out sub-keys, and I don't believe I removed the send-log in between. The caches should probably also be invalidated if used with a different key, if that doesn't happen already.

andrewchambers · Answer 16 · Sat Feb 04 2023 01:11:53 GMT+0800 (China Standard Time)

It's worth investigating, there is a chance of a bug there - sub keys are probably not tested as much as just using a single key. In general they should be in a completely different 'hash space' and never have colliding chunks.

AndrolGenhald · Answer 17 · Sat Feb 04 2023 12:54:49 GMT+0800 (China Standard Time)

I'm mostly convinced that my issues all stem from stale caches, but I'll continue testing next week to see if I can reproduce any issues without stale caches. I think most people would expect stale caches to not cause problems, but with bupstash's security model I think a put sub-key actually has no way to tell if the cache is valid, so that makes things a bit more difficult.

Hopefully just checking the inode will be enough to prevent people from shooting themselves in the foot, but maybe something should be added to the documentation as well? Another possible issue with that is filesystems like sshfs, where the inode numbers could change each time it's mounted, although the documentation does already make it clear that you shouldn't use sshfs to mount a bupstash repository, and the only real issue there would be that the cache is useless, which wouldn't cause any corruption, just slower backups.

andrewchambers · Answer 18 · Sat Feb 04 2023 17:13:14 GMT+0800 (China Standard Time)

Very good points - I think the lack of documentation on this was definitely a problem. I also note that if you hit this issue there might have been 10 more who did and never mentioned it. Ideally users should be totally protected from this sort of misuse as its beyond the level of understanding I expect for typical users.

I will have to think a bit about inodes that change, there might be other ways we can detect copies too (for example mtime of certain files that should never be modified manually).

AndrolGenhald · Answer 19 · Fri Feb 10 2023 02:04:32 GMT+0800 (China Standard Time)

I haven't been able to reproduce any corruption issues without using a stale cache, so I'll go ahead and close this. Thanks for all the help!

andrewchambers · Answer 20 · Fri Feb 10 2023 02:20:29 GMT+0800 (China Standard Time)

No problem, I will still add a the cache fix to resolve the other ticket. I really appreciate bug reports.