spotify / voyager

🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability.

Home Page:https://spotify.github.io/voyager/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Index 'ef' is lost in index loading

cvillela opened this issue · comments

voyager = 2.0.2

When constructing an index, assigning an "ef" value to it for querying and saving the index to a file with the "save", more especifically:
save(file_like: BinaryIO)

The "ef" parameter gets reset to default when loading an Index from that file by index = Index.load(file_obj)

ex:

index = Index(
    space,
    M=M,
    ef_construction=ef_construction,
    random_seed=random_seed,
)
ids = index.add_items(embbedings, ids)
index.ef = 200
print(index.ef) # ---> 200
io_index = io.BytesIO()
index.save(io_index)

index = Index.load(io_index)
print(index.ef) #---> 10

Hey @cvillela, this is indeed a valid issue. We will fix this, but note that we are planning to deprecate this parameter in the future. Having a ef parameter for the index to use as a default query_ef is slightly dangerous because ef can not be smaller than the number of neighbors (k) requested. Instead of this instance level parameter, we will likely update the behavior of .query() to set the ef equal to k if query_ef isn't passed.

Because of this, I would recommend against using this parameter and instead pass query_ef explicitly to each query call, even if it's just simulating the behavior mentioned above.

Happy to hear any thoughts on the above though!

Hey @markkohdev , thank you for the reply!

This indeed solves the issue, although I believe it would be useful to have a constant ef exactly for this purpose: sharing index behaviour between different applications with save() and load(). But I understand the desing choice and am closing the issue.

Nevertheless, I am facing some trouble assessing the exact impact of the M, ef_construction and ef parameters on querying results. The documentation is slightly vague in saying that the mentioned parameters increase recall while taking longer to compute.

Are there any guidelines for choosing these parameters? I assume that given a constant index size, increasing the parameters should not present significant improvement of the querying after a certain value.

How exactly do the parameters affect the querying? Let's say len(index) == 1000. What would be a good parameter choice for a good recall (80/90% of the optimal using the approximation method). Could you point me to any documents that touch on this explanation?

Thank you again for the reply and great work on the repo!