Search index is missing articles with single-word titles

Question

Search index is missing articles with single-word titles

dylanmccall opened this issue 3 years ago · comments

I encountered the following with a Wikipedia zim file and the version2 branch of Zimply:

>>> zim = zimply_core.ZIMFile("wikipedia_simple_endless-kidzsearch_maxi_2022-03.zim", 'utf-8')

# The "Pancake" article is at 187257
>>> zim.read_directory_entry_by_index(187257)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': 'A',
 'revision': 0,
 'clusterNumber': 1403,
 'blobNumber': 105,
 'url': 'Pancake',
 'title': '',
 'index': 187257}

# We get its offset in the file...
>>> pancake_offset = zim._read_url_offset(187257)

# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_offset)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': b'A',
 'revision': 0,
 'clusterNumber': 1403,
 'blobNumber': 105}

# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Pancake'

# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
''

# The "Pancake Day" article redirects to 224476)
>>> zim.read_directory_entry_by_index(224476)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': 'A',
 'revision': 0,
 'clusterNumber': 896,
 'blobNumber': 50,
 'url': 'Shrove_Tuesday',
 'title': 'Shrove Tuesday',
 'index': 224476}

# Get the offset for the article
>>> pancake_day_offset = zim._read_url_offset(224476)

# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_day_offset)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': b'A',
 'revision': 0,
 'clusterNumber': 896,
 'blobNumber': 50}

# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove_Tuesday'

# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove Tuesday'

I learned that there is an optimization where, if an article's URL is the same as its title, the title field is left blank, and Zimply does the right thing here inside ZIMFile.get_article_by_index. However, that is not the case when we are building a the search index. The code in CreateFTSProcess which goes through articles uses ZIMFileIterator, which calls unpack_from_file directly. That code leaves the (sometimes blank) title field as-is.

kimbauters · Answer 1 · Thu Mar 03 2022 17:07:31 GMT+0800 (China Standard Time)

Thanks for the detailed steps to reproduce and the pull request. You are quite right that this should be moved to unpack_from_file.