Various tags such as q, br, ins, del are not fitered out
adno opened this issue · comments
Adam Nohejl commented
Many elements/tags appear in wikiextractor's output, such as poem
, q
, ins
, del
, br
, section
, onlyinclude
, includeonly
, math
or mathematical equations (with commands such as \mathbf
) not enclosed in any tags.
- Download this dump:
https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
- Invoke the following command to list lines that contain the opening tags of these elements:
wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(poem\|q\|section\|ins\|del\|math\|onlyinclude\|br\|chem\)\b'
Examples from the output:
<poem>
<poem style="margin-left:2em">
<br>"domestic:" good automatic telephone system
…
Benzene, <chem>C6H6</chem>, …
…
<section end="Big Brother series" />
…
<onlyinclude>
…
<chem>O2{} + 4H+(aq){} + 4 Fe^{2+}(cyt\,c) -> 2H2O{} + 4 Fe^{3+}(cyt\,c) </chem> formula_1
…
</includeonly><section end=Lineups />
(Not all of the tags appear in this particular dump.)
Adam Nohejl commented