This repo is a modification of prop2vec code with some additional features on top of it.
- fastText to get word-vectors
- preprocessing step to modify the
.conllu
file to fit it to the fastText model
- added support for WLTMN input
- W: wordform
- L: lemma
- T: morphological tags
- M: morphemes
- N: n-grams
- morphemes are extracted with word2morph which takes longer time because a neural network is used to extract morphemes from a given lemma
- n-grams are added during the preprocessing step (not fasttext default one)
- To download sample data in
.conllu
format:
wget https://raw.githubusercontent.com/UniversalDependencies/UD_Armenian-ArmTDP/master/hy_armtdp-ud-train.conllu -P datasets
wget https://raw.githubusercontent.com/UniversalDependencies/UD_Russian-SynTagRus/master/ru_syntagrus-ud-train.conllu -P datasets
- To download the preprocessed wiki corpus:
wget https://github.com/MartinXPN/morph2vec/releases/download/v0.2.0/ru-wiki-text.zip -P datasets
unzip ru-wiki-text.zip
- The evaluation data can be obtained from russe-evaluation corpus
- To prepare the data for training a fastText model:
# [Optional] Preprocess the evaluation data
PYTHONHASHSEED=0 python -m morph2vec.data.preprocess preprocess_eval --input_path datasets/eval-train.txt --output_path datasets/eval-train-processed.txt --locale ru
PYTHONHASHSEED=0 python -m morph2vec.data.preprocess preprocess_eval --input_path datasets/eval-test.txt --output_path datasets/eval-test-processed.txt --locale ru
# Preprocess the conllu corpus
PYTHONHASHSEED=0 python -m morph2vec.data.preprocess preprocess_conllu --input_path datasets/ru_syntagrus-ud-train.conllu --output_path datasets/ru_processed_wltmn.txt --locale ru
# Preprocess the wiki corpus
PYTHONHASHSEED=0 python -m morph2vec.data.preprocess preprocess_wiki datasets/ru-wiki-text.txt --output_path datasets/ru-wiki.wltmn --locale ru
- To train a fastText model (Training a model on half of the russian wiki takes ~4 hours on 4 core CPU):
PYTHONHASHSEED=0 python -m morph2vec.train
train_unsupervised --input datasets/ru-wiki.wltmn --model skipgram --props w+l+t+m --lr 0.05 --dim 300 --ws 5 --epoch 5 --minCount 5 --minCountLabel 0 --minn 3 --maxn 6 --neg 5 --wordNgrams 1 --loss ns --bucket 2000000 --thread 15 --lrUpdateRate 100 --t 1e-4 --label __label__ --verbose 2 --pretrainedVectors ""
save_model --path logs/ru-wltm.bin
- To do a hyperparameter search:
PYTHONHASHSEED=0 python -m morph2vec.hyperparametersearch
--eval_train_path datasets/eval-train-processed.txt --eval_test_path datasets/eval-test-processed.txt
search_hyperparameters --nb_trials 500 --input_path datasets/ru_processed_wltmn.txt --props "w+l+t+m+n"
- To evaluate the model:
# Evaluate and save results:
PYTHONHASHSEED=0 python -m morph2vec.evaluation.fasttexteval evaluate --model_path logs/ru.bin --data_path datasets/hj-expanded-rare-all-processed.txt --save_vectors_path logs/hj-pred.txt
# Bootstrapping to get confidence intervals, mean and std of Spearman's correlation
PYTHONHASHSEED=0 python -m morph2vec.evaluation.fasttexteval bootstrap --gold_path datasets/hj-expanded-rare-all-processed.txt --predicted_path logs/hj-pred.txt --bootstrap_count 10000 --confidence_percent 0.95
# Or by providing a model instead of the (word vector) file
PYTHONHASHSEED=0 python -m morph2vec.evaluation.fasttexteval bootstrap --gold_path datasets/hj-expanded-rare-all-processed.txt --model_path logs/ru.bin --bootstrap_count 10000 --confidence_percent 0.95