vinhkhuc / JFastText

Java interface for fastText

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

why JFastText allowed only model trained with JFastext?

ali3assi opened this issue · comments

commented

Hello,

How can read a pretrained model? I try to load the preexisting files .vec and .bin, but the load model raises an excpetion. Its looks like the format incompatible and JFastText allowed only model trained with JFastext.

You can upgrade the fastText within the cpp folder to the released version. Then run mvn clean install. The compiled jar package with dependency will be compatible with newer pre-trained models.

@lidalei I got errors in upgrade fastText as below. Could you check this in your convenience time? Thanks.

In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102:
In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13:
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper.cc:83:18: warning:
'getVector' is deprecated: getVector is being deprecated and replaced by
getWordVector. [-Wdeprecated-declarations]
fastText.getVector(vec, word);
^
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/fasttext.h:63:3: note:
'getVector' has been explicitly marked deprecated here
FASTTEXT_DEPRECATED(
^
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/utils.h:15:50: note:
expanded from macro 'FASTTEXT_DEPRECATED'

define FASTTEXT_DEPRECATED(msg) attribute((deprecated(msg)))

                                             ^

In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102:
In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13:
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper.cc:84:38: error:
'data_' is a protected member of 'fasttext::Vector'
return std::vector(vec.data_, vec.data_ + vec.m_);
^
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/vector.h:26:23: note:
declared protected here
std::vector data_;
^
In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102:
In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13:
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper.cc:84:49: error:
'data_' is a protected member of 'fasttext::Vector'
return std::vector(vec.data_, vec.data_ + vec.m_);
^
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/vector.h:26:23: note:
declared protected here
std::vector data_;
^
In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102:
In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13:
/Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper.cc:84:61: error:
no member named 'm_' in 'fasttext::Vector'
return std::vector(vec.data_, vec.data_ + vec.m_);

'getVector' is deprecated: getVector is being deprecated and replaced by
getWordVector. Besides, class Vector was rewritten. You cannot access data_ or m_ member of a vector. Instead, you have to use vector.data() and vector.size(). I'd suggest have a look at my fork https://github.com/lidalei/JFastText

@lidalei
Thanks.
I just used the code in your fork but got the following error in loadModel as following, could you have a look:

Exception in thread "main" java.lang.UnsatisfiedLinkError: com.github.jfasttext.FastTextWrapper$FastTextApi.checkModel(Ljava/lang/String;)Z
at com.github.jfasttext.FastTextWrapper$FastTextApi.checkModel(Native Method)
at com.github.jfasttext.JFastText.loadModel(JFastText.java:29)
at com.github.jfasttext.JFastText.main(JFastText.java:203)

Could you release you code?

commented

Hello Sir @lidalei I just install your code : https://github.com/lidalei/JFastText

I take the generated two jar JFastText/target/ and added them to buildinf path in eclipse.

In my testDriver method i declared:

import com.github.jfasttext.JFastText;

public class TestDriver {

	public static void main(String[]args){
		JFastText jft = new JFastText();
		
		
	}
}

So, runing the code i get the follwing exception:

Exception in thread "main" java.lang.UnsatisfiedLinkError: no jniFastTextWrapper in java.library.path
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
	at java.lang.Runtime.loadLibrary0(Runtime.java:870)
	at java.lang.System.loadLibrary(System.java:1122)
	at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1191)
	at org.bytedeco.javacpp.Loader.load(Loader.java:953)
	at org.bytedeco.javacpp.Loader.load(Loader.java:854)
	at com.github.jfasttext.FastTextWrapper.<clinit>(FastTextWrapper.java:11)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.bytedeco.javacpp.Loader.load(Loader.java:913)
	at org.bytedeco.javacpp.Loader.load(Loader.java:854)
	at com.github.jfasttext.FastTextWrapper$FastTextApi.<clinit>(FastTextWrapper.java:442)
	at com.github.jfasttext.JFastText.<init>(JFastText.java:23)
	at TestDriver.main(TestDriver.java:6)

Any idea how to solve this issue please?

You should merely use jfasttext-0.1.0-jar-with-dependencies.jar, which can be generated by running mvn clean install.

Btw, you should clone the subfolder 'src/main/cpp/fastText' to compile a native library. @TamouzeAssi @xikunlun001 https://github.com/lidalei/fastText

commented

Unfortunately, it is not working under windows

commented

The problem still existing. We try to load pre-trained model, When we read this model by jft.loadModel(path/to/pretarined_model)
we get the following exception

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Model file's format is not compatible with this JFastText version!

Note that we get the the new fork of JFastText then we delete the file in src/cpp/ fasttext and clone agian this fasttext then run mvn clean install.

Any idea on solving this problem

@TamouzeAssi Model file's format is not compatible with this JFastText version means you should train your model with the corresponding fastText or JFastText. Don't use pip to install fasttext which is not official. Follow this to install Python binding, https://github.com/facebookresearch/fastText/tree/master/python.

commented

@lidalei Thank you first for your cooperation. I will try to clone the fastext in your mentionned link to the subfolder cpp in JFastext please correct me if im wrong.

I want to use the pretrained model bi the library fasttext like wiki.en. So this model trained by fastext which is different from JFasttext. I dont want to train again due to several reason.
Thank you

You can download word embeddings from https://fasttext.cc/docs/en/pretrained-vectors.html. I haven't tried but believe they work. JFastText relies on fastText. If JFastText complains, it means the model was trained with a non-compatible version fastTex with the fastText JFastText is using.

commented

@lidalei Sorry but still not working. The same exception is raised when i try to load word embeddings from l.

I clone your fork for JFastext then delete the folder cpp/fastText and clone again this file from where you said and then mvn clean install.
and the exception still existing.

Can you please descrive the step or try to load a pretained model using JFastText?

@TamouzeAssi I guess you were trying to load a word embedding. It cannot! Try to load a model from https://fasttext.cc/docs/en/language-identification.html.

@TamouzeAssi If it did not work, try to use python interface of fastText to load your model.

commented

@lidalei:
the model lid.176.bin from https://fasttext.cc/docs/en/language-identification.html can be loaded in JFastext without any error.

But withe wiki.en from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md the model generate the incompatible file format.

By the way do you have any good reference to a model learned on wikipedia. Im looking to use the vector embedding to can cover the OOV. please

@TamouzeAssi There is no problem with this. pretrained vectors are just a word embedding that represents a word as a vector. It does not do any classification task. A classifier is built on the word embedding. For example, you can represent a sentence as mean vector of its words's vectors and train a classifier to classify an unknown sentence.

I don't have a model for you. It really depends on your task. What do you want to achieve?

commented

@lidalei for the moment i want just to compute the similarity between two sentences where some noise exists (miss typo). So, i used word2vec but i get bad result due to OOV so i go to use fastText to can use the subword information.

commented

@lidalei i was used gensim to get the word2vec model but i developped my algo in java, and gensim can be used with java even we use jython language.

@TamouzeAssi I will add the function to my JfastText repo and tell you as soon as I complete.

@TamouzeAssi It won't help you soon. I'd suggest you check

void FastText::loadVectors(std::string filename) {
  std::ifstream in(filename);
  std::vector<std::string> words;
  std::shared_ptr<Matrix> mat; // temp. matrix for pretrained vectors
  int64_t n, dim;
  if (!in.is_open()) {
    throw std::invalid_argument(filename + " cannot be opened for loading!");
  }
  in >> n >> dim;
  if (dim != args_->dim) {
    throw std::invalid_argument(
        "Dimension of pretrained vectors (" + std::to_string(dim) +
        ") does not match dimension (" + std::to_string(args_->dim) + ")!");
  }
  mat = std::make_shared<Matrix>(n, dim);
  for (size_t i = 0; i < n; i++) {
    std::string word;
    in >> word;
    words.push_back(word);
    dict_->add(word);
    for (size_t j = 0; j < dim; j++) {
      in >> mat->at(i, j);
    }
  }
  in.close();

  dict_->threshold(1, 0);
  input_ = std::make_shared<Matrix>(dict_->nwords()+args_->bucket, args_->dim);
  input_->uniform(1.0 / args_->dim);

  for (size_t i = 0; i < n; i++) {
    int32_t idx = dict_->getId(words[i]);
    if (idx < 0 || idx >= dict_->nwords()) continue;
    for (size_t j = 0; j < dim; j++) {
      input_->at(idx, j) = mat->at(i, j);
    }
  }
}

and write some Java code to read pretrained vectors.

commented

@lidalei Thank you i will try to write similar code. By the way let me know when you add the function to your JFastText repo please

val fasttext = new JFastText()
fasttext.loadModel("/home/work/XX/model/model.bin")

java.lang.IllegalArgumentException: Model file doesn't exist!