Infer only returns embedding of one sentence
rmporsch opened this issue · comments
Robert Porsch commented
Given a list of input Tuples in the form of Tuple[List[str], int]
I initially expected to get a numpy matrix returned of size (n, vector_size).
I suspect this is due to the following line:
Should it be something like this?
output = zeros((statistics["total_sentences"], self.sv.vector_size), dtype=REAL)
Reproducible example from the tutorial:
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")
sentences = []
for d in data:
# Let's blow up the data a bit by replicating each sentence.
for i in range(8):
sentences.append(d["question1"].split())
sentences.append(d["question2"].split())
s = IndexedList(sentences)
model = SIF(glove, workers=2)
model.train(s)
tmp = ("Hello my friends".split(), 0)
model.infer([tmp, tmp])
Oliver Borchers commented
Thank you! As soon as I've got some free time I will deal with it!
Oliver Borchers commented
It's a feature, not a bug :-)
By using the following code you create a 2-to-1 mapping. fse supports n-to-m mappings, so that you sum multiple results onto the same vector. This behavior is dependent on the index you pass.
Passing two sentences with index 0 will result in a 2-to-1 mapping.
2-to-1 mapping -> (1, 100):
tmp = ("Hello my friends".split(), 0)
model.infer([tmp, tmp]).shape
1-to-1 mapping -> (2, 100):
tmp1 = ("Hello my friends".split(), 0)
tmp2 = ("Hello my friends".split(), 1)
model.infer([tmp1, tmp2]).shape