saalfeldlab / n5

Not HDF5

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support removing blocks

hanslovsky opened this issue · comments

Currently, the N5Writer interface does not expose an interface to delete a block. Tools that copy or convert data with the option to overwrite have to write all blocks of a data set, even if they are "empty" (e.g. all zero), e.g. saalfeldlab/n5-spark#12. Not writing out "empty" blocks is in particular useful for large datasets with substantial empty areas as brought up in saalfeldlab/paintera-conversion-helper#33. I propose to add support for removing blocks at the N5Writer level:

N5Writer.deleteBlock(String, DatasetAttributes, long... blockPosition);

Another option would be to implicitly delete blocks if a special implementation of the DataBlock is passed to indicate deletion. Unfortunately, we cannot pass null for the DataBlock because the block also holds the blockPosition.

I personally prefer the explicit option (minor version bump) but I would like to hear opinions before I start working on this.

cc @aschampion @igorpisarev

I like your suggestion about adding a separate method for it, especially since there is no easy way to integrate it into writeBlock() as you pointed out.

This is becoming increasingly more important, overwriting existing files is slower than writing new files, e.g. on the SSD in my workstation:

$ dd if=/dev/zero of=test bs=4k count=$((128*2000))
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.17341 s, 894 MB/s
$ dd if=/dev/zero of=test bs=4k count=$((128*2000))
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 58.83 s, 17.8 MB/s
$ rm test
$ dd if=/dev/zero of=test bs=4k count=$((128*2000))
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.16586 s, 899 MB/s

This is consistent every time I try and repeat and probably affects N5 writes as well (albeit not as big). If we can delete blocks, we can alwayas remove blocks before writing.

Update: This is relevant for ext4