Write out the DMA buffer partially?

Question

Write out the DMA buffer partially?

vlovich opened this issue 2 years ago · comments

In normal I/O, if you wanted to make sure your data structure is aligned on some large boundary (e.g. 8 MiB) but you wanted to do smaller writes, you would simply write out how much data you have (aligned to 4kib boundaries for Direct I/O) and lseek to the next 8 MiB boundary.

I recognize that glommio can't do this today, but I wonder if such functionality could be added. One possible way I'm thinking this would work is that I allocate a 8 MiB DMA buffer but then if I fill it up only partially (e.g. 128 KiB), I can call .truncate on the DmaBuffer (which will requires that the truncated length still has valid alignment) or call a write_at_partially so that I don't write the entire buffer to disk.

Vitali Lovich · Answer 1 · Sat Oct 22 2022 08:05:38 GMT+0800 (China Standard Time)

Not sure if exposing trim_to_size is all that's needed although I don't know if that'll actually work (haven't tried it out).

Glauber Costa · Answer 2 · Mon Oct 24 2022 23:51:16 GMT+0800 (China Standard Time)

Hello @vlovich.

this should definitely work with some variation of write_at and filesystem flags to extend the size. Are you talking specifically about the stream api ?

Vitali Lovich · Answer 3 · Tue Oct 25 2022 02:57:54 GMT+0800 (China Standard Time)

No direct I/O via DmaFile. I'm managing my own buffers directly so the stream API isn't really useful for my purposes. Specifically, I have no problem creating holes with write_to. The problem is that I don't know what the actual size of the right will be at the time of allocation.

Basically I:
1 Allocate 8 MiB
2. User input fills up some portion of the 8 Mib
3. I write out the buffer

I want to change step 3 so that if the user only fills up 1 MiB, I only write out 1 MiB instead of 1 MiB of data & 7 MiB of zeroes. The next buffer would be written out at pos + 8 MiB so the kernel would do all the write things to create an interim hole (I've tested that part works but the missing piece is the ability to change the size of the write after step 2 since step 1 acquires an 8 MiB buffer).

Glauber Costa · Answer 4 · Tue Oct 25 2022 04:07:34 GMT+0800 (China Standard Time)

I think your best bet is a positional write followed (or preceeded) by fallocate or ftruncate.

There is no such thing as unallocated space in a file in general from the VFS PoV. Individual filesystems may have optimizations like that, but if a file has a certain size, the filesystem will commit blocks to it.

Whether or not they get zeroed is a different matter, but they usually are - otherwise you would just access bytes from another application that may have released the file.

I'd encourage you to take a look at both ftruncate and fallocate (glommio exposes both) and figure out which works best. fallocate has more specialized modes that may not zero, but they are full of caveats.

Vitali Lovich · Answer 5 · Tue Oct 25 2022 07:07:19 GMT+0800 (China Standard Time)

That doesn't actually help because you're still going to get write amplification to the flash. Most Linux filesystems (XFS, ext4, btrfs AFAIK) will all write a much smaller amount of data to record the hole which would significantly mitigate the amplification.

Vitali Lovich · Answer 6 · Tue Oct 25 2022 07:30:50 GMT+0800 (China Standard Time)

Also fallocate isn't exposed itself AFAICT. It's only exposed within the crate so that pre_allocate can invoke it.

pre_allocate itself has the surprising (wrong?) behavior that calling it on an existing file will end up erasing whatever is already in there which isn't what fallocate is supposed to do:

Any subregion within the range specified by offset and len that did not contain data before the call will be initialized to zero