The unified corpus building environment for Language Models.
langumo
is an unified corpus building environment for Language Models.
langumo
provides pipelines for building text-based datasets. Constructing
datasets requires complicated pipelines (e.g. parsing, shuffling and
tokenization). Moreover, if corpora are collected from different sources, it
would be a problem to extract data from various formats. langumo
helps to
build a dataset with the diverse formats simply at once.
- Easy to build, simple to add new corpus format.
- Fast building through performance optimizations (even written in Python).
- Supports multi-processing in parsing corpora.
- Extremely less memory usage.
- All-in-one environment. Never mind the internal procedures!
- Does not need to write codes for new corpus. Instead, add to the build configuration simply.
- nltk
- colorama
- pyyaml>=5.3.1
- tqdm>=4.46.0
- tokenizers==0.8.1
- mwparserfromhell>=0.5.4
- kss==1.3.1
langumo
can be installed using pip
as follows:
$ pip install langumo
You can install langumo
from source by cloning the repository and running:
$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install
Let's build a Wikipedia
dataset. First, install langumo
in your virtual
enviornment.
$ pip install langumo
After installing langumo
, create a workspace to use in build.
$ mkdir workspace
$ cd workspace
Before creating the dataset, we need a Wikipedia dump file (which is a source of the dataset). You can get various
versions of Wikipedia dump files from here.
In this tutorial, we will use
a part of Wikipedia dump file.
Download the file with your browser and move to workspace/src
. Or, use wget
to get the file in terminal simply:
$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
langumo
needs a build configuration file which contains the details of
dataset. Create build.yml
file to workspace
and write belows:
langumo:
inputs:
- path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
parser: langumo.parsers.WikipediaParser
build:
parsing:
num-workers: 8 # The number of CPU cores you have.
tokenization:
vocab-size: 32000 # The vocabulary size.
Now we are ready to create our first dataset. Run langumo
!
$ langumo
Then you can see the below outputs:
[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo) ███████████████████████████████████ 100
[00:00:00] Tokenize words ███████████████████████████████████ 418863 / 418863
[00:00:01] Count pairs ███████████████████████████████████ 418863 / 418863
[00:00:02] Compute merges ███████████████████████████████████ 28942 / 28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609 of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt
After building the dataset, workspace
would contain the below files:
workspace
├── build
│ ├── corpus.eval.txt
│ ├── corpus.train.txt
│ └── vocab.txt
├── src
│ └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml
usage: langumo [-h] [config]
The unified corpus building environment for Language Models.
positional arguments:
config langumo build configuration
optional arguments:
-h, --help show this help message and exit
You can find the langumo
documentation
on the website.
langumo
is Apache-2.0 Licensed.