Running time is too long when using Zero-Shot Classification

Question

Running time is too long when using Zero-Shot Classification

syGOAT opened this issue 2 months ago · comments

For test, I just set n_clusters=5 in KMeans and input only 20 docs.

umap_model = UMAP(n_neighbors=20, n_components=15, min_dist=0.0, metric='cosine', random_state=42)
cluster_model = KMeans(n_clusters=5, random_state=42)  
vectorizer_model = CountVectorizer(stop_words="english")

model = BERTopic(embedding_model='/root/autodl-tmp/fhy/bertopic_topic/paraphrase-MiniLM-L6-v2', 
                 umap_model=umap_model,
                 hdbscan_model=cluster_model,
                 vectorizer_model=vectorizer_model,
)
topics, probabilities = model.fit_transform(abstracts[:20], nr_repr_docs=1)

The code above ran for only a few seconds. But the code bellow:

candidate_topics = ['Material Science: General', 'Physical Chemistry', 'Chemistry: General', ......]
# about 70 elements

representation_model = ZeroShotClassification(candidate_topics, model="./bart-large-mnli", )
model.update_topics(abstracts[:20], representation_model=representation_model)

It ran for more than 20 minutes without ending. It is too long.
I set only 3 candidate topics (ZeroShotClassification(candidate_topics[:3], model="./bart-large-mnli")) and code ended after 4mins.
So the problem may be that there are too many candidate topics. But I don't think 70 is a lot. Maybe you can optimize the parallel batch processing capability, when the pipeline of transformers is not the problem?

Maarten Grootendorst · Answer 1 · Tue Apr 09 2024 19:46:23 GMT+0800 (China Standard Time)

It ran for more than 20 minutes without ending. It is too long.
I set only 3 candidate topics (ZeroShotClassification(candidate_topics[:3], model="./bart-large-mnli")) and code ended after 4mins.

It is difficult to say whether this is short or long without knowing a bit more about your environment. Are you using a GPU? If not, then that might explain the problem you are facing.

So the problem may be that there are too many candidate topics. But I don't think 70 is a lot. Maybe you can optimize the parallel batch processing capability, when the pipeline of transformers is not the problem?

Most likely, the compute time is a result of the transformers pipeline and parallel batch processing is a bit more involved when you are dealing with GPUs. Generally, I think this should be reasonably fast if you are using a GPU.

syGOAT · Answer 2 · Wed Apr 10 2024 15:18:46 GMT+0800 (China Standard Time)

@MaartenGr Thank you for reply! I used a gpu. I think this is where the problem lies:
https://github.com/MaartenGr/BERTopic/blob/6c9eb6e72a881077ac59c35752d26e391bfe4c49/bertopic/representation/_zeroshot.py#L72C9-L74C104
topic_descriptions had 100 topics (my scene) and a lot of words. It was passed directly to pipline ZeroShotClassificationPipeline without batched. I think it was the large amount of data processed by the pipeline at one time that leaded to the long running time. Maybe add an arguement batch_size in ZeroShotClassification could be better?

Maarten Grootendorst · Answer 3 · Wed Apr 10 2024 15:40:36 GMT+0800 (China Standard Time)

Which GPU are you using? One or the other makes quite a bit of difference.

I think it was the large amount of data processed by the pipeline at one time that leaded to the long running time. Maybe add an arguement batch_size in ZeroShotClassification could be better?

That could be a possibility but have you tested whether that is indeed the issue? You can test this out by simply adopting the code you referenced there to supply custom batches. Part of BERTopic is its modularity, so adapting it should be straightforward.

syGOAT · Answer 4 · Fri Apr 12 2024 18:00:42 GMT+0800 (China Standard Time)

@MaartenGr I found the problem. Function __call__ of class Pipeline in transformers has an arguement batch_size:
https://github.com/huggingface/transformers/blob/caa5c65db1f4db617cdac2ad667ba62edf94dd98/src/transformers/pipelines/base.py#L1157C3-L1170C46
We could consider adding it in pipeline_kwargs:
https://github.com/MaartenGr/BERTopic/blob/de7376d3d42960e787a1115e5fe69fb726a7a33d/bertopic/representation/_zeroshot.py#L55C1-L74C104

Maarten Grootendorst · Answer 5 · Sun Apr 14 2024 14:11:16 GMT+0800 (China Standard Time)

Ah, in that case it is also already implemented right? Just do something like this:

representation_model = ZeroShotClassification(candidate_topics, model="./bart-large-mnli", pipeline_kwargs={"batch_size":32})

syGOAT · Answer 6 · Sun Apr 14 2024 14:54:12 GMT+0800 (China Standard Time)

@MaartenGr Yes. The problem has been solved. Thank you so much!