fastText4j原是Mynlp一个子模块,现在独立成为一个开源项目。(mynlp是一个高性能、模块化、可扩展的中文NLP工具包 )

Implementing Facebook's FastText with java. Fasttext is a library for text representation and classification by facebookresearch. It implements text classification and word embedding learning.



  • Implementing with java(kotlin)
  • Well-designed API
  • Compatible with original C++ model file (include quantizer compression model)
  • Provides training api (almost the same performance)
  • Support for java file formats( can read file use mmap),read big model file with less memory



compile 'com.mayabot:fastText4j:1.2.2'




1. Train model


  • ModelName.sup supervised
  • skipgram
  • ModelName.cow cbow
//Word representation learning
FastText fastText = FastText.train(new File(""),;

// Text classification

FastText fastText = FastText.train(new File(""), ModelName.sup);

data.txt is also encoded in utf-8 with one sample each line. And it needs to do word spliting beforehand as well. There is a string starting with __label__ in each line,representing the classifying target, such as __label__正面. Each sample could have multiple label. Through the attribute 'label' in TrainArgs, you can customise the head.

2. save model

save model to java format


3. load model

public Fasttext loadModel(String modelPath, boolean mmap)

//load from java format 
FastText fastText = FastText.loadModel("path/data.model",true);

//load from c++ format
FastText fastText = FastText.loadFasttextBinModel("path/wiki.bin") 

4. quantizer compression

FastText quantize(FastText fastText , int dsub=2, boolean qnorm=false)

//load from java format 
FastText quantizerFastText = FastText.quantize(fastText,2,false);


//predict the result of a word
List<FloatStringPair> predict = fastText.predict(Arrays.asList("fastText在预测标签时使用了非线性激活函数".split(" ")), 5);

6.Nearest Neighbor Search

List<FloatStringPair> predict = fastText.nearestNeighbor("**",5);


By giving three words A, B and C, return the nearest words in terms of semantic distance and their similarity list, under the condition of (A - B + C).

List<FloatStringPair> predict = fastText.analogies("国王","皇后","男",5);

Ag News example

test agnews data set, train and predict by fastText4j


   Read 5M words
   Number of words:  95812
   Number of labels: 4
   Progress: 100.00% words/sec/thread:  5792774 lr: 0.00000 loss: 0.28018 ETA: 0h 0m 0s
   Train use time 5275 ms
   rate 0.9064473684210527

Parameters of TrainArgs

The parameters is consistant with the C++ version :

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurences [1]
  -minCountLabel      minimal number of label occurences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]


Official pre-trained model

Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.


fastText is BSD-licensed. facebook provide an additional patent grant.


