MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page:https://maartengr.github.io/BERTopic/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`hierarchical_topics()` produce incorrect output when three topics have the same distance

salted-adam opened this issue · comments

Hi there,

I have noticed that hierarchical_topics(...) method produces incorrect results when three or more topics have the same (tf-idf) distances. Let me illustrative it with an example.

from umap import UMAP
from bertopic import BERTopic

docs = (
    ["banana"] * 300 
    + ["banana apple"] * 300 
    + ["pear"] * 300 
    + ["lemon"] * 300 
    + ["clock"] * 300 
)

model = BERTopic(umap_model=UMAP(random_state=42))
topics, probs = model.fit_transform(docs)
hr = model.hierarchical_topics(docs)
hr

This outputs
image

The cluster with Parent_ID == 8 includes topics [1, 2, 3] but topic 3 is not mentioned in neither left or right child or their childs.

Why is it happing?
The flat structure is created in each iteration. There is no guarantee that new cluster will contain only two topics while the code that follows presumes that.

What should be expected behaviour?
I think that a new cluster should emerge. Essentially, the structure should look like this

Parent_Id, Child_Left_ID, Child_Right_ID
     8               1               11
    11               2               3