Error calculating coherence score for BERTopic model trained on Indic language

Question

Error calculating coherence score for BERTopic model trained on Indic language

sanketshinde0707 opened this issue 6 months ago · comments

OCTIS version: 1.13.1
Python version: 3.10.12
Operating System: Google Colab

Description

I am working with BERTopic and I am trying to evaluate my topic models trained on Marathi language (Indic language) using some metrics.I found this code written by MaartenGR (Author of BERTopic) but unfortunately I was not able to install the dependencies of the setup he has mentioned here (https://github.com/MaartenGr/BERTopic_evaluation/tree/main). The author recommended using OCTIS as it provides more metrics. I tried calculating the topic diversity and npmi score. The topic diversity is calculated,but I keep getting issues while calculating npmi score.

Here is my code

from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity

#This is how the sentence arrays looks
sentence array = ['तीन दिवस झाले, पण गाडी अजून सापडली नाही. पोलिसांचा कडक तपास सुरु आहे.' , 'डाळी भारतीय थाळीमध्ये सामील असलेले मुख्य भोजन आहेत.'] 

#This is how the topics are 
topics_list = [
['ठाकरे', 'एक', 'भारतीय', 'दिवस', 'शिंदे', 'सांगितले', 'दोन', 'माहिती', 'देण्यात', 'जात'],
['भारतीय', 'शिंदे', 'ठाकरे', 'मुख्यमंत्री', 'उद्धव', 'एक', 'पोलीस', 'धावा', 'दोन', 'सरकार'],
['देण्यात', 'फोन', 'डेटा', 'कॅमेरा', 'स्मार्टफोन', 'सादर', 'डिस्प्ले', 'सेन्सर', 'सपोर्ट', 'बॅटरी']
]

octis_texts = [sentence_array]
npmi = Coherence(texts = octis_texts, topk = 10, measure = 'c_npmi')
octis_output = {"topics": list1}
topic_diversity = TopicDiversity(topk=10)

topic_diversity_score = topic_diversity.score(octis_output)
print("Topic diversity: "+str(topic_diversity_score))

npmi_score = npmi.score(octis_output)
print("Coherence: "+str(npmi_score))

Error

This is the error I get.

Topic diversity: 0.8857142857142857
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-68-c000efdb667a>](https://localhost:8080/#) in <cell line: 5>()
      3 print("Topic diversity: "+str(topic_diversity_score))
      4 
----> 5 npmi_score = npmi.score(octis_output)
      6 print("Coherence: "+str(npmi_score))

3 frames
[/usr/local/lib/python3.10/dist-packages/gensim/models/coherencemodel.py](https://localhost:8080/#) in _ensure_elements_are_ids(self, topic)
    452             return np.array(ids_from_ids)
    453         else:
--> 454             raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')
    455 
    456     def _update_accumulator(self, new_topics):

ValueError: unable to interpret topic as either a list of tokens or a list of ids

Can anyone point out what exactly is wrong here and how can i evaluate BERTopic models trained on indic languages.

Thanks.

Jie Zhao · Answer 1 · Sun Apr 07 2024 02:28:13 GMT+0800 (China Standard Time)

Hey I've encountered the same issue - have you resolved it yet?