daveshap / PlainTextWikipedia

Convert Wikipedia database dumps into plaintext files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Article titles can contain ":"

Dinoguy1000 opened this issue · comments

Currently, analyze_chunk() removes all titles that contain : under the assumption that these are non-mainspace titles. However, article titles can contain colons, e.g. Batman v Superman: Dawn of Justice or UTC+03:00 (or, on Simple Wikipedia, UTC+08:00 or Avatar: The Last Airbender). Many of these titles are actually redirects to titles without a colon, but all redirects are already removed by this point in the function, so that's immaterial.

That's a good point. I will re-run without that rule and see what we get.

There are a crazy number of non-articles with a colon in the title so we'll need to go back to the drawing board about how to filter these out while keeping the articles we want.