oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Infer only returns embedding of one sentence

rmporsch opened this issue · comments

Given a list of input Tuples in the form of Tuple[List[str], int] I initially expected to get a numpy matrix returned of size (n, vector_size).
I suspect this is due to the following line:

output = zeros((statistics["max_index"], self.sv.vector_size), dtype=REAL)

Should it be something like this?

output = zeros((statistics["total_sentences"], self.sv.vector_size), dtype=REAL)

Reproducible example from the tutorial:

import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
        sentences.append(d["question1"].split())
        sentences.append(d["question2"].split())
s = IndexedList(sentences)

model = SIF(glove, workers=2)
model.train(s)
tmp = ("Hello my friends".split(), 0)
model.infer([tmp, tmp])

Thank you! As soon as I've got some free time I will deal with it!

It's a feature, not a bug :-)

By using the following code you create a 2-to-1 mapping. fse supports n-to-m mappings, so that you sum multiple results onto the same vector. This behavior is dependent on the index you pass.
Passing two sentences with index 0 will result in a 2-to-1 mapping.

2-to-1 mapping -> (1, 100):

tmp = ("Hello my friends".split(), 0)
model.infer([tmp, tmp]).shape

1-to-1 mapping -> (2, 100):

tmp1 = ("Hello my friends".split(), 0)
tmp2 = ("Hello my friends".split(), 1)
model.infer([tmp1, tmp2]).shape