Using my own solutions for evaluation

Question

Using my own solutions for evaluation

Hctor99 opened this issue 2 years ago · comments

Hello,

I would like to evaluate topic models on coherence, significance, diversity, etc. by providing topic solutions (top 10 words by k topics) directly rather than calling the TM technique.

I can create my own preprocessed dataset and provide the vocabulary.txt used for training. I have two questions:

Is there a way to pass those lists of words to OCTIS, and if yes, which format should I use? And how do I load them?

If I want the internal evaluation to be on the full dataset, should the partition column consist of “val” only?

Thanks in advance!

Silvia Terragni · Answer 1 · Tue May 24 2022 14:26:12 GMT+0800 (China Standard Time)

Hello!
The output of a topic model in OCTIS is a dictionary with these fields:

topics: the list of the most significative words for each topic (list of lists of strings).
topic-word-matrix: an NxV matrix of weights where N is the number of topics and V is the vocabulary length. This represents the distribution of each topic over the vocabulary.
topic-document-matrix: a matrix of weights where N is the number of topics and D is the number of documents in the corpus. This represents the distributions of the topics for each document.

You said that you want to compute coherence, significance and diversity, so you would probably need the first two fields ("topics" and "topic-word-matrix" - this one is for some significance metrics).

So imagine you want to compute the topic diversity, you would have something like this:

from octis.evaluation_metrics.diversity_metrics import TopicDiversity

model_output = {'topics': [['word_0', 'word_1', 'word_2'], ['word_3', 'word_4', 'word_5']]}
metric = TopicDiversity(topk=3) # Initialize metric
topic_diversity_score = metric.score(model_output) # Compute score of the metric

Some topic significance metrics consider the word-topic and/or document-topic distribution. So in that case you need to fill the dictionary with the required field. For example, the KL-uniform uses the word-topic distribution.

Topic coherence requires a dataset to compute the co-occurrences between the words. So you would need to provide a dataset, in the form of list of documents (list of lists of strings) (you can find the parameters of the coherence here). So in that case, you don't need to use OCTIS's Dataset class and specify and partition.

Hope this was useful. Let me know if you have further questions :)

Silvia

Héctor Leos · Answer 2 · Fri Jun 03 2022 03:34:46 GMT+0800 (China Standard Time)

Thank you for your detailed response! It was really helpful.