__pack_tensor__ should be in the beginning of the file to avoid seeking the whole file

Question

__pack_tensor__ should be in the beginning of the file to avoid seeking the whole file

dmikushin opened this issue 7 months ago · comments

I have another concern about __pack_tensor__. According to hexedit, the __pack_tensor__ entry is located in the end of .bl2 file. I think this is an inefficient choice for large files. Suppose I have a 10 GiB bl2 file. I don't want to read it entirely, but knowing its shapes is essential for almost any usecase. So in order to read the shape, the c-blosc2 would need to fseek() up to the end of file. Of course, seeking is much faster than reading the content, but the file I/O would still need to hop over the inodes of the fragmented representation of big file in the filesystem. So why not to eliminate all this extra load on the filesystem by always placing metadata nodes in the beginning of the file? Is there an industry standard or practice that requires metadata to be placed in the end of file?

Francesc Alted · Answer 1 · Tue Jan 30 2024 00:01:30 GMT+0800 (China Standard Time)

You are creating this issue in the wrong repo (this kind of metadata is supported on Blosc2 only). Could you try to use the proper https://github.com/Blosc/python-blosc2/issues instead?

Dmitry Mikushin · Answer 2 · Tue Jan 30 2024 00:08:07 GMT+0800 (China Standard Time)

Sorry, somehow I missed the extra little '2' in the end: Blosc/python-blosc2#162