Calculation and loading embedding model in metrics.

Question

Calculation and loading embedding model in metrics.

hyades910739 opened this issue 2 years ago · comments

OCTIS version: 1.11.0

Description

Hi, there's a couple of issues I want to discuss, mainly foucing on class WECoherencePairwise, WECoherenceCentroid, and metrics using embedding.
If these issues are reasonable, I am happy to create a PR for these issues. Thanks! :)

1. Calculation:

In score function, the L1 normalization for word embedding and Centroid (t) seems to be unnecessary because it use cosine similarity later on, the score will be identical without the normalization.
(See L122 and L196 in https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py)

2.1 Default word embedding model:

Take WECoherenceCentroid for example, it load word2vec-google-news-300 as default model weight. Since the file size is big (1662 MB), it cost lots of time to download the model weight, which may not be a good thing for a default behavior. Also, some research (see Table 2 in [1]) shows that GloVe (glove-wiki-gigaword-50, 65 MB) may perform better than Word2Vec.

So, I think it may be a good idea to switch default model to glove-wiki-gigaword-50, or other word2vec model with smaller model size.

([1] Ding, Ran, Ramesh Nallapati, and Bing Xiang. "Coherence-aware neural topic modeling." arXiv preprint arXiv:1809.02687 (2018).)

2.2 parameter name and function input:

Also, it seems reasonable to be able to calculate WECoherence with different embedding model. Currently you can use word2vec_path to pass any w2v format model (e.g., glove-wiki-gigaword-50 provided by gensim), but the parameter name is really confusing. So if we want to support models other than word2vec, we should change the parameter name word2vec_path to more general one.

2.3 KeyedVectors as function input:

Currently, if you want to use several embedding related metrics (e.g., use WECoherencePairwise and WECoherenceCentroid at the same time), every metric instance will create it's own embedding model, which is unnecessary. To avoid this, we can take KeyedVectors as input directly.

2.4 support dict-like embedding object.

When calculate the metrics, we only need embedding for few words in topics. So there's no need to load the whole embedding model. In this case, we can use dict-like object to store embeddings we need.

In octis, the KeyedVectors only use in operation to check embedding exist, and use __getitem__ to get the embedding. So, we can simply convert dict like object to KeyedVectors, and make KeyedVectors related functions compatible with embeddings in dictionary format as input. Also, we only need to change code in __init__, without changing complex code like WordEmbeddingsInvertedRBOCentroid.scre.

Based on 2.1, 2.2, 2.3, 2.4 we can change WECoherenceCentroid.__init__ to following, and all metrics that use embedding can load model by function _load_keyedvectors.

from typing import Dict, Union
from numpy import ndarray
from gensim.models import KeyedVectors

class WECoherenceCentroid(AbstractMetric):
    def __init__(self, topk=10, model: Union[Dict[str,ndarray],KeyedVectors, None]=None, model_name: Optional[str]=None):
        '''
        If model is provided, use model first. If model is None, use model_name or default_model_name.
        '''
        if model is None and model_name is None:
            model_name = 'some_default_model_name'
        self._wv = _load_keyedvectors(model, model_name)

        
def _load_keyedvectors(model, model_name):
    if model is not None:
        if isinstance(model, KeyedVectors):
            pass
        elif isinstance(model, dict):
            model = _convert_dict_to_keyedvectors(dic)
        else:
            raise ValueError()                
    else model_name is not None:
        model = api.load(model_name)
    else:
        raise ValueError()
    return model

def _convert_dict_to_keyedvectors(dic):    
    vector_size = dic.values()
    arr = next(iter(dic.values()))
    assert len(arr.shape) == 1
    vector_size = arr.shape[0]
    kv = KeyedVectors(vector_size)    
    kv.key_to_index = {k:idx for idx, k in enumerate(dic.keys())}
    kv.vectors = np.stack(list(dic.values()))
    return kv

hyades910739 · Answer 1 · Wed Jan 18 2023 10:55:55 GMT+0800 (China Standard Time)

I already forked and created 2 branch to fix this.

https://github.com/hyades910739/OCTIS/tree/issue89_remove_unnecessary_normalization_in_coherence_metrics
This one remove the unnecessary normalization code.
https://github.com/hyades910739/OCTIS/tree/issue89_change_embedding_model_loading_logic_in_metrics
This one create a MixIn to handle embedding model loading.