attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Various tags such as q, br, ins, del are not fitered out

adno opened this issue · comments

Many elements/tags appear in wikiextractor's output, such as poem, q, ins, del, br, section, onlyinclude, includeonly, math or mathematical equations (with commands such as \mathbf) not enclosed in any tags.

  1. Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
  2. Invoke the following command to list lines that contain the opening tags of these elements:

wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(poem\|q\|section\|ins\|del\|math\|onlyinclude\|br\|chem\)\b'

Examples from the output:

<poem>
<poem style="margin-left:2em">
<br>"domestic:" good automatic telephone system
…
Benzene, <chem>C6H6</chem>, …
…
<section end="Big Brother series" />
…
<onlyinclude>
…
<chem>O2{} + 4H+(aq){} + 4 Fe^{2+}(cyt\,c) -> 2H2O{} + 4 Fe^{3+}(cyt\,c) </chem> formula_1
…
</includeonly><section end=Lineups />

(Not all of the tags appear in this particular dump.)

There similar issues with mapframe and score elements (#301) and table formatting (#298).