treatment of on-disk segments as "what was written by programs" can cause areas of `0` to not be written by `bmaptool copy`

Question

treatment of on-disk segments as "what was written by programs" can cause areas of `0` to not be written by `bmaptool copy`

codyps opened this issue 4 years ago · comments

I've run into this while working with bmaptool as a component in wic (the open embedded image creator) when using ZFS as the filesystem.

The key issue is this:

SEEK_HOLE tells us where holes in the file are, SEEK_DATA where non-holes are
holes correspond to runs of zeros
holes don't necessarily correspond to unwritten regions, they may also correspond to regions that were written with just zeros

In my specific case, I've been using zero filled files to ensure that some partitions (as configured via wic) would be zeroed. Unfortunately, they aren't on my system (which is using zfs as the storage file system).

Everything may be work fine on other filesystems that happen to implement HOLE/DATA information/storage differently. But it seems like bmaptool may be relying on filesystem specific behavior about how HOLE/DATA is handled.

While examining this issue, I wrote some code that uses SEEK_HOLE and SEEK_DATA to examine the behavior wrt zero writes (and how files are segmented).

On a zfs filesystem, given 1 large seek followed by the write of 1 byte (that is a zero) it indicates the entire file is a hole (via the ENXIO return). Similarly, even if there is data at the start, the zero byte written to the file does not get read as DATA.

sparse-demo.c

#define _GNU_SOURCE 1
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <unistd.h>

int main(void) {
	unlink("tmp.bin");
	int fd = open("tmp.bin", O_RDWR|O_CREAT, 0666);
	if (fd < 0) {
		fprintf(stderr, "E: could not open tmp.bin: %s\n", strerror(errno));
		return -1;
	}

	{
		unsigned char buf[1] = { 1 };
		ssize_t rs = write(fd, buf, sizeof(buf));
		if (rs < 0) {
			fprintf(stderr, "E: could not write 1 byte (1): %s\n", strerror(errno));
			return -1;
		}
	}

	off_t soffs = lseek(fd, 10ull * 1024 * 1024 * 1024, SEEK_SET);
	if (soffs < 0) {
		fprintf(stderr, "E: could not seek 1: %s\n", strerror(errno));
		return -1;
	}

	{
		unsigned char buf[1] = { 0 };
		ssize_t rs = write(fd, buf, sizeof(buf));
		if (rs < 0) {
			fprintf(stderr, "E: could not write 1 byte (1): %s\n", strerror(errno));
			return -1;
		}
	}

	soffs = lseek(fd, 0, SEEK_SET);
	if (soffs < 0) {
		fprintf(stderr, "E: could not seek to start: %s\n", strerror(errno));
		return -1;
	}

	off_t c_offs = 0;
	for (;;) {
		soffs = lseek(fd, c_offs, SEEK_DATA);
		if (soffs < 0) {
			if (errno == ENXIO) {
				printf("ENXIO\n");
				soffs = lseek(fd, 0, SEEK_END);
				if (soffs < 0) {
					fprintf(stderr, "E: could not find end after ENXIO: %s\n", strerror(errno));
					return -1;
				}
				printf("END: %ju\n", (uintmax_t)soffs);
				return 0;
			} else {
				fprintf(stderr, "E: seek data at offs %ju failed: %s\n", (uintmax_t)c_offs, strerror(errno));
				return -1;
			}
		}

		printf("DATA: %ju to %ju\n", (uintmax_t)c_offs, (uintmax_t)soffs);
		c_offs = soffs;

		soffs = lseek(fd, c_offs, SEEK_HOLE);
		if (soffs < 0) {
			fprintf(stderr, "E: seek hole at %ju failed: %s\n", (uintmax_t)c_offs, strerror(errno));
			return -1;
		}

		printf("HOLE: %ju to %ju\n", (uintmax_t)c_offs, (uintmax_t)soffs);
		c_offs = soffs;
		if (soffs == 0) {
			printf("I: zero size hole, done\n");
			break;
		}
	}

	return 0;
}

output from sparse-demo.c on my zfs system

DATA: 0 to 0
HOLE: 0 to 131072
ENXIO
END: 10737418241

Unfortunately, it's somewhat unclear how to resolve this problem. The reliance on how particular filesystems behave wrt HOLE/DATA indication seems fairly core to the design of bmaptool. Perhaps testing for filesystem behavior that would break the assumptions made in bmaptool would be somewhat helpful.

Artem Bityutskiy · Answer 1 · Tue Feb 02 2021 20:14:23 GMT+0800 (China Standard Time)

Hi, there were some ZFS work recently, is this still an issue? Thanks!

Jmesmon Chasel · Answer 2 · Wed Feb 03 2021 22:32:09 GMT+0800 (China Standard Time)

If you're referring to f1cd6ec, which adds checking of zfs_dmu_offset_next_sync, no. This issue exists regardless of the setting of that parameter.

Note that this also isn't necessarily a zfs specific issue. zfs on linux just happens to be the filesystem implementation that exposes the problem.

Artem Bityutskiy · Answer 3 · Wed Feb 03 2021 23:56:50 GMT+0800 (China Standard Time)

Well, I do not remember SEEK_HOLE/SEEK_DATA, and need to research, but what I hear is that bmaptool should not use them, they do not work the way bmaptool assumes. We need to find mapped and unmapped regions. Regions populated with zeroes (zeroes written) must be counted as mapped. FIEMAP works this way, and we assumed SEEK_HOLE is the same, but you are saying this is incorrect.

If you are right, we need to remove SEEK_HOLE support completely...

Jmesmon Chasel · Answer 4 · Fri Feb 05 2021 10:16:28 GMT+0800 (China Standard Time)

I'd be cautious about making that generalization about FIEMAP too: the documentation for FIEMAP indicates it is for retrieving the layout on disk. It does not, as far as I can tell, require that filesystems store runs of zeros (even if written explicitly by some program through the filesystem) as allocated/written extents.

This is why I considered a possible solution could be probing the filesystem on each use to determine if it's behavior allows the use of bmaptool (by checking the assumption that written regions of zeros show up as either DATA ranges in SEEK HOLE/DATA or as allocated extents via FIEMAP).

Artem Bityutskiy · Answer 5 · Fri Feb 05 2021 17:56:52 GMT+0800 (China Standard Time)

Yeah, I am pretty sure in ext4 the unmapped regions are always real holes. But you are right, and your idea is good. May be we can test different file-systems and just have a built-in knowledge about how they behave. Probably it is useless to do the test on ext4, for example.

I do not have time to work on this tool any longer. It needs love, and you or other people are welcome to contribute, and to become maintainers too.