关于simcse build_index 的速度问题

Question

关于simcse build_index 的速度问题

Maydaytyh opened this issue 6 months ago · comments

想问一下，我在调用model.build(sentences)时，为什么batch_size设置更大，如从64->2048，显存是够的，但是推理速度变慢呢？以及当要build_index的句子非常多（千万级别）时，会在encoding结束后发生被kill的情况，如图所示，执行到这里后程序会崩溃，这个有什么解决方案吗？

谢谢！

Tianyu Gao · Answer 1 · Tue Dec 26 2023 20:38:54 GMT+0800 (China Standard Time)

Hi,

Using larger batch size does not always lead to better inference speed. As long as the utilization of the GPU is 100% there is no need to set a large batch size. The kill is probably due to exceeding the CPU memory limit (the index you build is too big that it exceeds the available cpu memory). First you can try a smaller index to see if it still happens. If it doesn't (which suggests this is a memory-related issue), you cantry to split the index into several smaller index.

Maydaytyh · Answer 2 · Tue Dec 26 2023 21:29:28 GMT+0800 (China Standard Time)

Thanks for your reply!
The kill will not happed if the number of sentences is small, so it's a memory-related issue. And can you show me how to split the index to several smaller index? Is it being done when model.build_index() or just split the dataset files?
Thank you!

Tianyu Gao · Answer 3 · Wed Dec 27 2023 02:43:32 GMT+0800 (China Standard Time)

The code does not split that. You can manually implement this by just splitting your arrays of strings to small arrays and encode them separately

Maydaytyh · Answer 4 · Wed Dec 27 2023 10:25:41 GMT+0800 (China Standard Time)

Thank you, I understand. After building separate indexes, how should I merge these indexes? I saw the add_to_index function in a previous issue, is this the way to do it? Thanks!

Tianyu Gao · Answer 5 · Wed Dec 27 2023 21:52:05 GMT+0800 (China Standard Time)

The index will be simply saved as a numpy array and you can just concatenate them. Note that eventually you still need a memory as large as the whole index to be able to merge them

Maydaytyh · Answer 6 · Wed Dec 27 2023 22:25:50 GMT+0800 (China Standard Time)

I get it. Thank you!