princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

关于simcse build_index 的速度问题

Maydaytyh opened this issue · comments

想问一下,我在调用model.build(sentences)时,为什么batch_size设置更大,如从64->2048,显存是够的,但是推理速度变慢呢?以及当要build_index的句子非常多(千万级别)时,会在encoding结束后发生被kill的情况,如图所示,执行到这里后程序会崩溃,这个有什么解决方案吗?
image

谢谢!

Hi,

Using larger batch size does not always lead to better inference speed. As long as the utilization of the GPU is 100% there is no need to set a large batch size. The kill is probably due to exceeding the CPU memory limit (the index you build is too big that it exceeds the available cpu memory). First you can try a smaller index to see if it still happens. If it doesn't (which suggests this is a memory-related issue), you cantry to split the index into several smaller index.

Thanks for your reply!
The kill will not happed if the number of sentences is small, so it's a memory-related issue. And can you show me how to split the index to several smaller index? Is it being done when model.build_index() or just split the dataset files?
Thank you!

The code does not split that. You can manually implement this by just splitting your arrays of strings to small arrays and encode them separately

Thank you, I understand. After building separate indexes, how should I merge these indexes? I saw the add_to_index function in a previous issue, is this the way to do it? Thanks!

The index will be simply saved as a numpy array and you can just concatenate them. Note that eventually you still need a memory as large as the whole index to be able to merge them

I get it. Thank you!