phueb / WikiCount

Count words in Wikipedia articles on multiple machines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WikiCount

Research code to count words in English Wikipedia 2018.

Requirements

  • access to the file server at the UIUC Learning & Language Lab is required. That is where Wikipedia articles are stored.
  • Ludwig - a Python package for parallel execution of jobs

Usage

Use the ludwig CLI to run all jobs (on your local machine or remote workers owned by the lab).

ludwig

Each job will do the following: One Python pickle file will be saved for each Wikipedia article folder included in the job. This file contains a list of Python dictionaries, each containing information about the number of times a word occurs in one article.

Compatibility

Tested on Ubuntu 18.04 using Python 3.6.

About

Count words in Wikipedia articles on multiple machines


Languages

Language:Python 100.0%