ml5js / training-word2vec

How to train your own word2vec model for use with ml5.js

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training

Python Environment

Requirements

pip install -r requirements.txt

Train the model

  1. Clone this repository or download this python script
git clone https://github.com/ml5js/training-word2vec/
  1. The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run train.py with the name of the file or folder.

Example:

python train.py file.xt
python train.py files/
  1. The script will output a vectors.txt and vectors.json file, however, if you would like to specify an output file name you can use the additional argument -o for that.
python train.py data.txt -o output.json
  1. The output JSON file can be used now with the ml5.js word2vec examples.

Advanced tokenization

The default tokenizer is very basic. You can ask the script to use NLTK's tokenizer with the --tokenizer argument.

Additionally, the script can remove stop words.

python train.py files/ -t nltk --remove-stop-words

About

How to train your own word2vec model for use with ml5.js

License:MIT License


Languages

Language:Python 100.0%