simran2097 / Research-Paper-Miner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WikiExtractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

For further information, see the project Home Page or the Wiki.

Wikipedia Cirrus Extractor

cirrus-extractor.py is a version of the script that performs extraction from a Wikipedia Cirrus dump. Cirrus dumps contain text with already expanded templates.

Cirrus dumps are available at: cirrussearch.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

About


Languages

Language:Python 99.4%Language:Shell 0.6%