ju-resplande / lista_pln

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lista PLN

Install requirements

    pip install wikiextractor gensim
    pip install -r portuguese_word_embeddings/requirements.txt

Steps

1. Download data

mkdir data
wget https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2 --directory-prefix ./data

2. Preprocess data

cd data
wikiextractor https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2 

cd ../1-billion-word-language-modeling-benchmark
bash scripts/get_data.sh # modified script

cd ../portuguese_word_embeddings
python preprocessing.py ../data/text.tokenized/wikipedia.pt.shuffled.sorted.tokenized ../data/wikipedia.pt.nilc

3.Training

bash glove.sh
bash wang2vec.sh
python fasttext.py
python word2vec.py

4. Evaluation

bash evaluate.sh glove
bash evaluate.sh wang2vec
bash evaluate.sh fasttext
bash evaluate.sh word2vec

About


Languages

Language:C 83.3%Language:Python 7.3%Language:Shell 3.8%Language:Perl 2.8%Language:MATLAB 2.2%Language:Makefile 0.7%Language:Emacs Lisp 0.0%