ZachNagengast / similarity-search-kit

🔎 SimilaritySearchKit is a Swift package providing on-device text embeddings and semantic search functionality for iOS and macOS applications.

Home Page:https://discord.gg/2vBQcF3nU5

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Save and load index functions are not public

dounan opened this issue · comments

Thank you for creating this project! I'm looking forward to using this in my personal project.

Out of curiosity, why are the saveIndex and loadIndex functions not public?

These functions are private mainly due to being alpha instantiations, haven't built them into any examples yet. How would you use them in your project? I'd like to make several options for saving the index, including saving directly into a coreml model, but if there's anything specific you're looking for I can look into adding it.

Ah that makes sense. I’m still playing around with the library so don’t have my full requirements ironed out. However, one gap I did notice was that I want to have multiple indexes that I save and load. However, I want to be able to “load all” indexes from all the saved json files in a directory. The current code structure requires me to create the index first, then call load with a directory URL and index name. This requires me to store metadata about the saved indexes separately which is a bit redundant for my use case.

On a separate note, I also noticed that Files.extractText recurses through directories, which may run into stack limits for deep folder structures. Another option is to use the FileManager enumerator function to deeply enumerate files in a directory.

Ah that makes sense. I’m still playing around with the library so don’t have my full requirements ironed out. However, one gap I did notice was that I want to have multiple indexes that I save and load. However, I want to be able to “load all” indexes from all the saved json files in a directory. The current code structure requires me to create the index first, then call load with a directory URL and index name. This requires me to store metadata about the saved indexes separately which is a bit redundant for my use case.

I think I understand maybe you can confirm or deny: There are multiple saved indexes in json format in a folder, and you want to load them all into one index with one call, without needing to specify the file name, just the directory? In your opinion, would it be better to load them all as an array of files, or just have it search the full directory and try to load anything in there? It should probably be careful to not combine indexes that were setup with different embedding models. This is getting close to some of the future plans I had for the hsnw/faiss algorithm, which stores all the embeddings in a bunch of different index files, and only loads a small portion of them at query time for NN search.

On a separate note, I also noticed that Files.extractText recurses through directories, which may run into stack limits for deep folder structures. Another option is to use the FileManager enumerator function to deeply enumerate files in a directory.

Nice recommendation, I was trying to replicate this rust library https://lib.rs/crates/dirstat-rs, and the current code gets somewhat close in terms of speed but I think it's limited by swift's high-level types. It definitely requires the ability to manage arbitrarily deep folders, in practice though I presume it won't need that hard-core functionality when dealing with specific folders that are set up manually. If you feel inspired - it would be great to see your implementation as a PR 🙌

Save and load index is now public with #16