sebischair / Lbl2Vec

Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Home Page:https://wwwmatthes.in.tum.de/pages/naimi84squl1/Lbl2Vec-An-Embedding-based-Approach-for-Unsupervised-Document-Retrieval-on-Predefined-Topics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: cannot compute similarity with no input

TechyNilesh opened this issue · comments

Hi Team,

I am getting following error while running model fit:

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings
2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings

ValueError Traceback (most recent call last)
in

~/SageMaker/lbl2vec/lbl2vec.py in fit(self)
248 # get doc keys and similarity scores of documents that are similar to
249 # the description keywords
--> 250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents(
251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs))
252

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4211 else:
4212 values = self.astype(object)._values
-> 4213 mapped = lib.map_infer(values, f, convert=convert_dtype)
4214
4215 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

~/SageMaker/lbl2vec/lbl2vec.py in (row)
249 # the description keywords
250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents(
--> 251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs))
252
253 # validate that documents to calculate label embeddings from are found

~/SageMaker/lbl2vec/lbl2vec.py in _get_similar_documents(self, doc2vec_model, keywords, num_docs, similarity_threshold, min_num_docs)
625 for word in cleaned_keywords_list]
626 similar_docs = doc2vec_model.dv.most_similar(
--> 627 positive=keywordword_vectors, topn=num_docs)
628 except KeyError as error:
629 error.args = (

~/anaconda3/envs/python3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, restrict_vocab, indexer)
775 all_keys.add(self.get_index(key))
776 if not mean:
--> 777 raise ValueError("cannot compute similarity with no input")
778 mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
779

ValueError: cannot compute similarity with no input

The keywords 'Crack', 'Broken' and 'Breakage' were not learned by the model and therefore unknown to it. Probably those were all keywords for your class but can't be used to compute a label vector because they are unknown. This results in an error.

This could have different reasons. The simplest explanation is that you used capitalized keywords, but the model only knows words that are lowercase. In this case, just convert your keywords to lowercase.

Another explanation could be that the keywords don't appear in your training corpus or have a low frequency. In this case I suggest you try some different keywords or add some more training data that the model can learn those keywords.

Is it possible to skip those terms that aren't in the document?

The unknown keywords are already skipped by default for computing the label vector. But when all keywords are unknown to the model, no keywords are left for label computation. This probably resulted in the error.