AndreiRegiani / wikipedia-crawler

Extracts plain-text from Wikipedia articles, ideal to perform linguistic analysis on a specific topic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wikipedia-crawler

Extracts plain-text from series of Wikipedia articles and saves to a local text file.

The goal is to have text samples of a specific language on a specific topic, so the output can be used on computer analysis applied to linguistics (word frequency, distribution, etc), or to generate wordlists of any language on Wikipedia (294 languages).

Usage:

python3 wikipedia-crawler.py https://en.wikipedia.org/wiki/Biology

Generates output.txt, extracting only a single article. Parameters to go crawling:

--articles=10 --interval=5 --output=biology.txt 

Generates biology.txt, crawling 10 articles related to Biology. Requests interval set to 5 seconds (default) to not abuse their servers. Session log containing all visited URLs is saved as session_biology.txt. Running with the same output will use the same session file.

In this example the initial article is Biology, the crawler will continue extracting related pages: Natural Science, Evolution, ...

Dependencies:

pip install -r requirements.txt

About

Extracts plain-text from Wikipedia articles, ideal to perform linguistic analysis on a specific topic

License:GNU General Public License v3.0


Languages

Language:Python 100.0%