kimbauters / ZIMply

An easy to use offline reader for ZIM files right in your browser!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Search index is missing articles with single-word titles

dylanmccall opened this issue · comments

I encountered the following with a Wikipedia zim file and the version2 branch of Zimply:

>>> zim = zimply_core.ZIMFile("wikipedia_simple_endless-kidzsearch_maxi_2022-03.zim", 'utf-8')

# The "Pancake" article is at 187257
>>> zim.read_directory_entry_by_index(187257)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': 'A',
 'revision': 0,
 'clusterNumber': 1403,
 'blobNumber': 105,
 'url': 'Pancake',
 'title': '',
 'index': 187257}

# We get its offset in the file...
>>> pancake_offset = zim._read_url_offset(187257)

# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_offset)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': b'A',
 'revision': 0,
 'clusterNumber': 1403,
 'blobNumber': 105}

# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Pancake'

# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
''

# The "Pancake Day" article redirects to 224476)
>>> zim.read_directory_entry_by_index(224476)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': 'A',
 'revision': 0,
 'clusterNumber': 896,
 'blobNumber': 50,
 'url': 'Shrove_Tuesday',
 'title': 'Shrove Tuesday',
 'index': 224476}

# Get the offset for the article
>>> pancake_day_offset = zim._read_url_offset(224476)

# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_day_offset)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': b'A',
 'revision': 0,
 'clusterNumber': 896,
 'blobNumber': 50}

# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove_Tuesday'

# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove Tuesday'

I learned that there is an optimization where, if an article's URL is the same as its title, the title field is left blank, and Zimply does the right thing here inside ZIMFile.get_article_by_index. However, that is not the case when we are building a the search index. The code in CreateFTSProcess which goes through articles uses ZIMFileIterator, which calls unpack_from_file directly. That code leaves the (sometimes blank) title field as-is.

Thanks for the detailed steps to reproduce and the pull request. You are quite right that this should be moved to unpack_from_file.