Blosc / python-blosc

A Python wrapper for the extremely fast Blosc compression library

Home Page:https://www.blosc.org/python-blosc/python-blosc.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

__pack_tensor__ should be in the beginning of the file to avoid seeking the whole file

dmikushin opened this issue · comments

Hi @FrancescAlted ,

I have another concern about __pack_tensor__. According to hexedit, the __pack_tensor__ entry is located in the end of .bl2 file. I think this is an inefficient choice for large files. Suppose I have a 10 GiB bl2 file. I don't want to read it entirely, but knowing its shapes is essential for almost any usecase. So in order to read the shape, the c-blosc2 would need to fseek() up to the end of file. Of course, seeking is much faster than reading the content, but the file I/O would still need to hop over the inodes of the fragmented representation of big file in the filesystem. So why not to eliminate all this extra load on the filesystem by always placing metadata nodes in the beginning of the file? Is there an industry standard or practice that requires metadata to be placed in the end of file?

You are creating this issue in the wrong repo (this kind of metadata is supported on Blosc2 only). Could you try to use the proper https://github.com/Blosc/python-blosc2/issues instead?

Sorry, somehow I missed the extra little '2' in the end: Blosc/python-blosc2#162