Article titles can contain ":"
Dinoguy1000 opened this issue · comments
Currently, analyze_chunk()
removes all titles that contain :
under the assumption that these are non-mainspace titles. However, article titles can contain colons, e.g. Batman v Superman: Dawn of Justice or UTC+03:00 (or, on Simple Wikipedia, UTC+08:00 or Avatar: The Last Airbender). Many of these titles are actually redirects to titles without a colon, but all redirects are already removed by this point in the function, so that's immaterial.
That's a good point. I will re-run without that rule and see what we get.
There are a crazy number of non-articles with a colon in the title so we'll need to go back to the drawing board about how to filter these out while keeping the articles we want.