attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Option to drop section titles/headers

Matthieu-Tinycoaching opened this issue · comments

Hi,

When extracting to JSON format a wikidump:
python -m wikiextractor.WikiExtractor -o simpleWikipedia --templates template --json --bytes 200M simplewiki-20220901-pages-articles.xml.bz2

I would like to remove all subsections titles/headers and keep only textual paragraphs of the corpus (e.g. remove "The Month" and "April in poetry" titles from this page: https://simple.wikipedia.org/wiki/April)

Would there be any option or simple fix in the code to do in order to discard headers/titles?

Thanks!

Hi,

@attardi any idea on how to deal with these?

Thanks!