`hierarchical_topics()` produce incorrect output when three topics have the same distance
salted-adam opened this issue · comments
Hi there,
I have noticed that hierarchical_topics(...)
method produces incorrect results when three or more topics have the same (tf-idf) distances. Let me illustrative it with an example.
from umap import UMAP
from bertopic import BERTopic
docs = (
["banana"] * 300
+ ["banana apple"] * 300
+ ["pear"] * 300
+ ["lemon"] * 300
+ ["clock"] * 300
)
model = BERTopic(umap_model=UMAP(random_state=42))
topics, probs = model.fit_transform(docs)
hr = model.hierarchical_topics(docs)
hr
The cluster with Parent_ID == 8 includes topics [1, 2, 3] but topic 3 is not mentioned in neither left or right child or their childs.
Why is it happing?
The flat structure is created in each iteration. There is no guarantee that new cluster will contain only two topics while the code that follows presumes that.
What should be expected behaviour?
I think that a new cluster should emerge. Essentially, the structure should look like this
Parent_Id, Child_Left_ID, Child_Right_ID
8 1 11
11 2 3