fix_block_sizes.py should check all blocks, not just those with size % 512 == 0

Question

fix_block_sizes.py should check all blocks, not just those with size % 512 == 0

hwertz opened this issue 3 years ago · comments

I found the new file size check added in 3.8.0 has been causing me crashes. I did have files that had the "rounded to next 512 bytes" corruption, which fix_block_sizes.py did indeed fix. I then had intermittent crashes anyway under heavy load. Since the size check is the only change since 3.7.3 that would affect my use, I took a close look there.

I THINK what's happening is a short race condition (in block_cache.py's async def _get_entry) between tmpfh.flush() and file_size=os.fstat(tmpfh.fileno()).st_size -- it seems obvious that flush should flush, but I found that both flush() and fsync() code are wrapped in Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS blocks. Docs say these calls are synchronous, but due to the ALLOW_THREADS stuff and calling function being asynchronous, I think I was very occasionally having Python decide to do the size check before the flush completed.

My proposed fix here, I turned off write buffering on the tmpfh. In this case, the tmpfh is written to by shutil.copyfileobj, which reads in BUFSIZE-sized (64KB) chunks anyway, so turning off write buffering should not result in any comically small writes or anything that'd slow it down.

I briefly tested moving up the fsync() so it's right below the flush(), but above the size check -- that also appeared to work (I suppose having anything happen between flush() and size check gives flush() time to finish), so if there's a good reason tmpfh must be buffered (some filesystem besides ext4 reacts badly or something) then that also seemed to be a viable fix.
Thanks!
--Henry

(Why on earth can't I just attach a .patch? Well, here's a patch in a .zip file, this disables tmpfh buffering.)
38-fix.zip

Nikolaus Rath · Answer 1 · Thu Nov 25 2021 16:42:22 GMT+0800 (China Standard Time)

Thank you for reporting this and looking into it!

Do you mean this code? https://github.com/s3ql/s3ql/blob/master/src/s3ql/block_cache.py#L728

I'm afraid this cannot be the the source of the crash. There is only one thread executing this code at a time (with tmpfh being private to this thread). So there is no way for fstat to be called before flush has terminated. The factor that other threads may run while flush is active doesn't mean that the thread calling flush can just "skip ahead".

I think you will find that even with the patch you will still have crashes. If not, then this is a side-effect of the changed timings (due to unbuffered writes being slower) avoiding another race condition.

Unfortunately I have no idea right now where the root of the problem might be.

Can you check if this is always happening for the same block, or for different blocks? And in particular that a block that has once triggered the problem does not always trigger it?

hwertz · Answer 2 · Fri Nov 26 2021 05:07:17 GMT+0800 (China Standard Time)

You're right! I began running s3ql_verify --data, I have all these blocks (from who knows what s3ql version, Ubuntu shipped with 3.3.x) where if the block ends in nulls, those nuls are not counted on the size. I pulled a few data blocks. For instance, one that appears to be a small .jar file is 210 bytes uncompressed but size is recorded as 207, it does have 3 NULs at the end, and it doesn't unzip if I remove those NULs.. it's wrong size rather than nuls being padded on the end. It's not just a 1-3 bytes thing, one file I found has 212 bytes added to size.

I did verify this is not a current bug... I copied a few of the s3ql_verify indicated s3ql_data_... files into /tmp/, ran the end part through zlib.decompress. I copied the results (using cp, cat, "rsync -avP" and "rsync -avPS" in case sparse file handling was triggering something.. I did use "-S" for some backups to these file systems in the past) back into the original s3ql file system, unmount, mount, read back. This crashes (since the files deduplicated with the original blocks with incorrect size in the table.) Copied these into a different s3ql file system instead, unmount, mount, read back, they read back fine.

My plan is to take out the "if cur_size %512 != 0: return False" from fix_block_sizes.py, so it fixes size on all my blocks and not just 512-byte-interval ones. I'll post back to verify this solves my issue -- it'll be a while, though, I have a 4TB desktop USB, 8TB desktop USB (only half full but still), and I left my 4TB portable plugged in there too, all 3 are USB3 at least but that's a lot of GBs to pull off that spinning rust.

Must say, s3ql is awesome! Having 6TB data on a 4TB disk, with 1TB space still free (on the real disk) is pretty awesome, performance is good, and it sure is resilient (I've got the usual relatively low USB reliability on my portable drive... intermittent bad cables, cables knocked out or loose, unclean shutdowns, a few "no reason at all" USB bus resets, and so on, and have not lost a single file other than the obvious "file that's in mid-copy". Keep up the good work!

hwertz · Answer 3 · Fri Nov 26 2021 15:32:12 GMT+0800 (China Standard Time)

Confirmed, I ran "fix_block_sizes.py" without the "if cur_size %512 !=0: return False" so it'd scan all blocks (actually, since I'd already run the stock fix_block_sizes.py a few days ago, I changed it to "if cur_size %512 == 0: return False" so it wouldn't waste time scanning the blocks it'd just scanned a day or two ago...). It found 20 or 30 blocks on two filesystems, and like 440 on the other. I built s3ql-3.8.0 with my "38-fix" patch removed, and I've been putting these filesystems through their paces since, no crashes! Thank goodness, I could have switched back to 3.7.3 but it would have bothered my sensibilities as a Python programmer to think it was playing fast and loose with returning early from flush() calls and such (edit: and to think there was some obscure race condition floating in the s3ql code.)

All the blocks it fixed were from about the first 50-60% of the blocks; I built one of the fileysystems May 2020 (per dates in s3ql_data_) and pumped several TB of data into it then, using whatever s3ql-3.3.2 or so version Ubuntu shipped with. (My .debs for s3ql-3.7.2, Trio, and Pyfuse3 are from May 2021 so I was running 3.3.x up 'til around then.) This suggests this bug was present in some 3.3.x version but not 3.7.x.

Incidentally, obviously this depends on proportion of small versus larger files. But in my case, I have quite a few larger files in my s3ql filesystems (and so lots of size-divisible-by-512 10MB blocks). Based on the time it took to run my first (512-multiple-only) fix_block_size.py run and the much faster time to run the 2nd (512-byte-multiple-excluded) run, I would estimate it only would add about 10% time tor run a full block scan compared to the existing 512-multipe-only scan.

Conclusion, I suggest removing the "if cur_size % 512 !=0: return False" from fix_block_sizes.py, since it then fixes both the more recent size bug and the earlier one.
Thanks!
--Henry

Nikolaus Rath · Answer 4 · Fri Nov 26 2021 18:40:39 GMT+0800 (China Standard Time)

Thanks for investigating! Yes, the patch sounds reasonable. Would you mind submitting a pull request?

hwertz · Answer 5 · Sat Nov 27 2021 07:47:23 GMT+0800 (China Standard Time)

I was mistaken, apparently the problem with size with nuls at end of file is still present, I just got a crash with a size mis-match on data involving blocks I just wrote out (s3ql_data_ block involved is dated today!) In this case, the block I caught (based on mount.log message) was an empty VirtualBox snapshot, 2MB file with lots of nuls at the end so the DB-reported size is off by over 500K! 8-) Thank goodness fix_block_sizes.py has --start-with option so I can skip scanning like 99% of my blocks! (This also means I have an easy test case now -- I can pick any VM in VirtualBox stored on s3qlfs, pick "Take snapshot", unmount and mount to clear the cache, and I can then go to open the VM (or I think even select if off the list) and it reads the 2MB file in... or attempts to and crashes the FS.)

I'm going to take a real close look at code that's involved in calculating the values written to blocks.size to the DB, and possibly the handling of sparse-file type stuff (I don't know if VBox is writing 500K of nulls, or if it's doing a seek() or truncate() or something that results in them.) I'll get back to you, thanks!

hwertz · Answer 6 · Sat Nov 27 2021 10:09:46 GMT+0800 (China Standard Time)

I've found the root cause. I had an empty s3ql filesystem on here (so the unmounts, fscks, etc. don't take more than a fraction of a second), I could create a VM in VirtualBox, unmount, remount, and as soon as VB went to touch it it read the 2MB file (which in this case was recorded as like 1.1MB..) and crash. I think my large number of older blocks found, and no recent ones, was because I had been using -S "sparse" for rsync, which apparently even sparses a few bytes, but at some point quit (it slowed rsync down a fair bit, and pretty useless when s3ql compresses the space out anyway.)

I'll put in a proper pull request over next day or two... but in short, in s3ql/block_cache.py truncate (around line 96) ... truncate can shrink a file OR expand it, so I took "if self.pos < self.size:" and "elif size < self.size:" lines, and replaced "<" with "!="; rsync (with -S) and VirtualBox apparently, use truncate followed directly by close() for nulls at the end of the file; the code in write means any write after truncate results in the correct size, but if it's truncated then closed size was not corrected.

Thanks!
--Henry