kimbauters / ZIMply

I am using the zim_core from the version2 branch (as used in https://pypi.org/project/zimply-core/) to read a Zim file from Kiwix that contains Wikihow content. Here is an example:

https://download.kiwix.org/zim/.hidden/endless/wikihow_en_endless_cars-and-other-vehicles_2021-12.zim

I am attempting to read it with the following code:

from zimply_core import zim_core
zim = zim_core.ZIMClient("wikihow_en_endless_cars-and-other-vehicles_2021-12.zim")
count = zim.get_namespace_count("C")
print(f"Articles count: {count}")

On a Linux system with an ext4 filesystem, it fails with the following traceback:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    zim = zim_core.ZIMClient("wikihow_en_endless_cars-and-other-vehicles_2021-12.zim")
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 1152, in __init__
    self.language + " (ISO639-1), articles: " + str(len(self._zim_file)))
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 734, in __len__
    result = self.get_namespace_range("A" if self.version <= (6, 0) else "C")
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 764, in get_namespace_range
    before = self.read_directory_entry_by_index(start_mid - 1)
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 590, in read_directory_entry_by_index
    directory_values = self._read_directory_entry(offset)
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 565, in _read_directory_entry
    self.file.seek(offset)  # move to the desired offset
OSError: [Errno 22] Invalid argument

I poked at this for a while, adding some messages to keep track of what it's doing.

get_namespace_range M
get_namespace_range: start_low 0
get_namespace_range: start_mid 795
get_namespace_range: start_high 1590
read_directory_entry_by_index 795
_read_directory_entry 8655750
read_directory_entry_by_index 794
_read_directory_entry 8655676
[…]
get_namespace_range: start_low 0
get_namespace_range: start_mid 2
get_namespace_range: start_high 4
read_directory_entry_by_index 2
_read_directory_entry 8605156
read_directory_entry_by_index 1
_read_directory_entry 8605116
get_namespace_range: start_low 0
get_namespace_range: start_mid 0
get_namespace_range: start_high 1
read_directory_entry_by_index 0
_read_directory_entry 8605062
read_directory_entry_by_index -1
_read_directory_entry 121364659855736
[Crash]

From the looks of it, Zimply is getting a file offset for a negative index, which isn't expected:

ZIMply/zimply/zim_core.py

Lines 530 to 539 in e1077d0

    
           def _read_offset(self, index, field_name, field_format, length): 
        
               # move to the desired position in the file 
        
               if index != 0xffffffff: 
        
                   self.file.seek(self.header_fields[field_name] + int(length * index)) 
        
                   # and read and return the particular format 
        
                   read = self.file.read(length) 
        
                   # return unpack("<" + field_format, self.file.read(length))[0] 
        
                   return unpack("<" + field_format, read)[0] 
        
               return None

In this instance, it ends up reading something from self.header_fields["Q"] + int(8 * -1), which as it turns out is a very big number.

Note that whether file.seek(121364659855736) results in an exception depends on the underlying filesystem, which makes this a little tricky to reproduce :)

Thanks for the detailed overview of the problem. It looks like the issue has to do with the binary search where it can exceed its boundaries. This would happen because of the -1 additional check that is needed to peek back and verify whether there is a namespace changeover or whether we need to continue looking (notably when mid_index == 0).

The pull request you submitted at first glance looks fine. Still, I think it may be better to move this into the read_directory_entry_by_index definition instead:

        # verify that the index is positive
        if index < 0:
            raise struct_error  # we never have a valid entry at an invalid index
        # find the offset for the given index

The reason I would put the fix there is that it captures all erroneous calls to read_directory_entry_by_index – it is not unimaginable that a future update would make additional calls to this method – and because the code block in the binary search is properly coded to deal with thrown errors.

Oh yeah, good point that we're already handling an exception, and raising it from read_directory_entry_by_index makes plenty of sense to me as well :) I updated the pull request accordingly.

	def _read_offset(self, index, field_name, field_format, length):
	# move to the desired position in the file
	if index != 0xffffffff:
	self.file.seek(self.header_fields[field_name] + int(length * index))

	# and read and return the particular format
	read = self.file.read(length)
	# return unpack("<" + field_format, self.file.read(length))[0]
	return unpack("<" + field_format, read)[0]
	return None

Zimply tries to seek to an invalid offset with some zim files