关于simcse build_index 的速度问题
Maydaytyh opened this issue · comments
Hi,
Using larger batch size does not always lead to better inference speed. As long as the utilization of the GPU is 100% there is no need to set a large batch size. The kill is probably due to exceeding the CPU memory limit (the index you build is too big that it exceeds the available cpu memory). First you can try a smaller index to see if it still happens. If it doesn't (which suggests this is a memory-related issue), you cantry to split the index into several smaller index.
Thanks for your reply!
The kill will not happed if the number of sentences is small, so it's a memory-related issue. And can you show me how to split the index to several smaller index? Is it being done when model.build_index()
or just split the dataset files?
Thank you!
The code does not split that. You can manually implement this by just splitting your arrays of strings to small arrays and encode them separately
Thank you, I understand. After building separate indexes, how should I merge these indexes? I saw the add_to_index
function in a previous issue, is this the way to do it? Thanks!
The index will be simply saved as a numpy array and you can just concatenate them. Note that eventually you still need a memory as large as the whole index to be able to merge them
I get it. Thank you!