Index 'ef' is lost in index loading
cvillela opened this issue · comments
voyager = 2.0.2
When constructing an index, assigning an "ef" value to it for querying and saving the index to a file with the "save", more especifically:
save(file_like: BinaryIO)
The "ef" parameter gets reset to default when loading an Index from that file by index = Index.load(file_obj)
ex:
index = Index(
space,
M=M,
ef_construction=ef_construction,
random_seed=random_seed,
)
ids = index.add_items(embbedings, ids)
index.ef = 200
print(index.ef) # ---> 200
io_index = io.BytesIO()
index.save(io_index)
index = Index.load(io_index)
print(index.ef) #---> 10
Hey @cvillela, this is indeed a valid issue. We will fix this, but note that we are planning to deprecate this parameter in the future. Having a ef
parameter for the index to use as a default query_ef
is slightly dangerous because ef
can not be smaller than the number of neighbors (k
) requested. Instead of this instance level parameter, we will likely update the behavior of .query()
to set the ef
equal to k
if query_ef
isn't passed.
Because of this, I would recommend against using this parameter and instead pass query_ef
explicitly to each query
call, even if it's just simulating the behavior mentioned above.
Happy to hear any thoughts on the above though!
Hey @markkohdev , thank you for the reply!
This indeed solves the issue, although I believe it would be useful to have a constant ef
exactly for this purpose: sharing index behaviour between different applications with save()
and load()
. But I understand the desing choice and am closing the issue.
Nevertheless, I am facing some trouble assessing the exact impact of the M
, ef_construction
and ef
parameters on querying results. The documentation is slightly vague in saying that the mentioned parameters increase recall while taking longer to compute.
Are there any guidelines for choosing these parameters? I assume that given a constant index size, increasing the parameters should not present significant improvement of the querying after a certain value.
How exactly do the parameters affect the querying? Let's say len(index) == 1000
. What would be a good parameter choice for a good recall (80/90% of the optimal using the approximation method). Could you point me to any documents that touch on this explanation?
Thank you again for the reply and great work on the repo!