Why and how the same model for doc_embeddings and word_embeddings?

Question

Why and how the same model for doc_embeddings and word_embeddings?

Atharvalite opened this issue 2 months ago · comments

BERT-based, or any transformer-based models output contextualized embeddings, which is correctly used for document embeddings generation. But to get word_embeddings, the same model is used, moreover, the array passed is just a list of raw candidate words, with no context, how will the word_embeddings hold any semantic meaning in that case?

In the BaseEmbedder class functionality is given to add word_embedding model, however in the "embed" method there is no way to differentiate between a list of documents and a list of words.

Maarten Grootendorst · Answer 1 · Fri Apr 05 2024 14:19:19 GMT+0800 (China Standard Time)

It depends on several things, including tokenization schemes but also training data, but in general, these models are also quite capable of creating word embeddings despite not having contextual information at the time of inference. As you might notice, and especially combined with MMR (which does take into account the relationship between words to a certain extent), this already produces quite good results.

The BaseEmbedder indeed started out with the additional option to pass a word embedding model but since both models needed to be in the same dimensional space to be comparable, this turned out to be something that could not easily be implemented. You can't really (or easily) compare the output embeddings of two different embeddings using distance functions. What has been on the list for a while is to extract the token embeddings before aggregation from sentence-transformers but that again depends on the underlying model.

Any suggestions for implementations are appreciated!