Changing the sequence of docs passed to KeyBERT returns different results for each doc

Question

Changing the sequence of docs passed to KeyBERT returns different results for each doc

Pratik--Patel opened this issue a year ago · comments

The result of KeyBERT doesn't seem to be deterministic when we change the order of documents that are passed to it. I have created a reproducible example as follows.

from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import json
from flair.embeddings import TransformerDocumentEmbeddings
from typing import Iterable, List, Tuple, Dict, Optional



vectorizer = KeyphraseCountVectorizer()
embeddingModel = TransformerDocumentEmbeddings('sentence-transformers/all-MiniLM-L6-v2')
model = KeyBERT(model=embeddingModel)


docs_1 = ["an apple a day keeps doctor away","strawberry is a good fruit","microsoft acquired openai","openai provides AI powered tools and APIs"]

#same set of documents but in different order
docs_2 = ["openai provides AI powered tools and APIs","microsoft acquired openai","strawberry is a good fruit","an apple a day keeps doctor away"]

        
result_1 = model.extract_keywords(
            docs=docs_1,
            vectorizer=vectorizer,
            top_n=10
        )
print(json.dumps(result_1, indent=2))


result_2 = model.extract_keywords(
            docs=docs_2,
            vectorizer=vectorizer,
            top_n=10
        )
print(json.dumps(result_2, indent=2))

The output is as follows

result_1 for docs_1

[
  [
    [
      "apple",
      0.8074
    ],
    [
      "doctor",
      0.7767
    ],
    [
      "day",
      0.6991
    ]
  ],
  [
    [
      "strawberry",
      0.9463
    ]
  ],
  [],
  [
    [
      "ai",
      0.8443
    ],
    [
      "apis",
      0.775
    ],
    [
      "tools",
      0.7478
    ]
  ]
]

result_2 for docs_2

[
  [
    [
      "openai",
      0.9083
    ],
    [
      "ai",
      0.8443
    ],
    [
      "tools",
      0.7478
    ]
  ],
  [
    [
      "openai",
      0.9262
    ]
  ],
  [
    [
      "good fruit",
      0.8823
    ]
  ],
  [
    [
      "apple",
      0.8074
    ],
    [
      "doctor",
      0.7767
    ],
    [
      "day",
      0.6991
    ]
  ]
]

As we can see, for the text strawberry is a good fruit, strawberry is extracted in result_1 whereas good fruit is extracted in result_2.

Any idea why this might be happening?

Maarten Grootendorst · Answer 1 · Thu Apr 13 2023 17:49:21 GMT+0800 (China Standard Time)

Could you try it without KeyphraseCountVectorizer? Perhaps there is something happening with the tokenizer there.

Pratik--Patel · Answer 2 · Thu Apr 13 2023 21:06:29 GMT+0800 (China Standard Time)

Thanks for swift reply. Not using KeyphraseCountVectorizer does seem to fix the issue but the quality of key phrases seems to be affected significantly.

Following is the code without KeyphraseCountVectorizer and corresponding results.

result_1 = model.extract_keywords(docs_1, keyphrase_ngram_range=(1, 3), stop_words='english')

result

[
  [
    [
      "apple day",
      0.83
    ],
    [
      "apple day keeps",
      0.8135
    ],
    [
      "apple",
      0.8074
    ],
    [
      "day keeps doctor",
      0.8011
    ],
    [
      "keeps doctor",
      0.7826
    ]
  ],
  [
    [
      "strawberry good fruit",
      0.9904
    ],
    [
      "strawberry good",
      0.9584
    ],
    [
      "strawberry",
      0.9463
    ],
    [
      "fruit",
      0.8886
    ],
    [
      "good fruit",
      0.8823
    ]
  ],
  [
    [
      "microsoft acquired openai",
      1.0
    ],
    [
      "acquired openai",
      0.9409
    ],
    [
      "openai",
      0.9262
    ],
    [
      "microsoft",
      0.8121
    ],
    [
      "microsoft acquired",
      0.8052
    ]
  ],
  [
    [
      "openai provides ai",
      0.972
    ],
    [
      "openai provides",
      0.9226
    ],
    [
      "openai",
      0.9083
    ],
    [
      "ai powered tools",
      0.855
    ],
    [
      "provides ai powered",
      0.8526
    ]
  ]
]

Many of the key phrases are sub set of some big key phrase. And then we have keywords like keep doctor which are not much meaningful.
Is there a way to include POS features or improve the quality? Ussing MMR and Max Sum Distance does not seem to help much. Thanks again for your help!

Maarten Grootendorst · Answer 3 · Thu Apr 13 2023 21:57:27 GMT+0800 (China Standard Time)

Currently, the only way to include POS features is by customizing the tokenizer in the CountVectorizer. That is where, in a way, the choice of candidate tokens is made. It is also the same process as the KeyphraseCountVectorizer is currently doing. It might also just be a bug, so posting an issue there might be worthwhile to do.

Pratik--Patel · Answer 4 · Thu Apr 13 2023 22:15:08 GMT+0800 (China Standard Time)

Thanks, will follow it up there.