Extraction protocol for arrow files is not defined
radulescupetru opened this issue · comments
Describe the bug
Passing files with .arrow
extension into data_files argument, at least when streaming=True
is very slow.
Steps to reproduce the bug
Basically it goes through the _get_extraction_protocol
method located here
The method then looks at some base known extensions where arrow
is not defined so it proceeds to determine the compression with the magic number method which is slow when dealing with a lot of files which are stored in s3 and by looking at this predefined list, I don't see arrow
in there either so in the end it return None:
MAGIC_NUMBER_TO_COMPRESSION_PROTOCOL = {
bytes.fromhex("504B0304"): "zip",
bytes.fromhex("504B0506"): "zip", # empty archive
bytes.fromhex("504B0708"): "zip", # spanned archive
bytes.fromhex("425A68"): "bz2",
bytes.fromhex("1F8B"): "gzip",
bytes.fromhex("FD377A585A00"): "xz",
bytes.fromhex("04224D18"): "lz4",
bytes.fromhex("28B52FFD"): "zstd",
}
Expected behavior
My expectation is that arrow
would be in the known lists so it would return None without going through the magic number method.
Environment info
datasets 2.19.0