Save and load index functions are not public

Question

Save and load index functions are not public

dounan opened this issue a year ago · comments

Thank you for creating this project! I'm looking forward to using this in my personal project.

Out of curiosity, why are the saveIndex and loadIndex functions not public?

Zach Nagengast · Answer 1 · Mon Jun 26 2023 23:09:24 GMT+0800 (China Standard Time)

These functions are private mainly due to being alpha instantiations, haven't built them into any examples yet. How would you use them in your project? I'd like to make several options for saving the index, including saving directly into a coreml model, but if there's anything specific you're looking for I can look into adding it.

Dounan Shi · Answer 2 · Tue Jun 27 2023 11:20:27 GMT+0800 (China Standard Time)

Ah that makes sense. I’m still playing around with the library so don’t have my full requirements ironed out. However, one gap I did notice was that I want to have multiple indexes that I save and load. However, I want to be able to “load all” indexes from all the saved json files in a directory. The current code structure requires me to create the index first, then call load with a directory URL and index name. This requires me to store metadata about the saved indexes separately which is a bit redundant for my use case.

Dounan Shi · Answer 3 · Tue Jun 27 2023 11:25:26 GMT+0800 (China Standard Time)

On a separate note, I also noticed that Files.extractText recurses through directories, which may run into stack limits for deep folder structures. Another option is to use the FileManager enumerator function to deeply enumerate files in a directory.

Zach Nagengast · Answer 4 · Tue Jun 27 2023 23:09:36 GMT+0800 (China Standard Time)

Ah that makes sense. I’m still playing around with the library so don’t have my full requirements ironed out. However, one gap I did notice was that I want to have multiple indexes that I save and load. However, I want to be able to “load all” indexes from all the saved json files in a directory. The current code structure requires me to create the index first, then call load with a directory URL and index name. This requires me to store metadata about the saved indexes separately which is a bit redundant for my use case.

I think I understand maybe you can confirm or deny: There are multiple saved indexes in json format in a folder, and you want to load them all into one index with one call, without needing to specify the file name, just the directory? In your opinion, would it be better to load them all as an array of files, or just have it search the full directory and try to load anything in there? It should probably be careful to not combine indexes that were setup with different embedding models. This is getting close to some of the future plans I had for the hsnw/faiss algorithm, which stores all the embeddings in a bunch of different index files, and only loads a small portion of them at query time for NN search.

On a separate note, I also noticed that Files.extractText recurses through directories, which may run into stack limits for deep folder structures. Another option is to use the FileManager enumerator function to deeply enumerate files in a directory.

Nice recommendation, I was trying to replicate this rust library https://lib.rs/crates/dirstat-rs, and the current code gets somewhat close in terms of speed but I think it's limited by swift's high-level types. It definitely requires the ability to manage arbitrarily deep folders, in practice though I presume it won't need that hard-core functionality when dealing with specific folders that are set up manually. If you feel inspired - it would be great to see your implementation as a PR 🙌

Zach Nagengast · Answer 5 · Sat Jul 15 2023 14:21:54 GMT+0800 (China Standard Time)

Save and load index is now public with #16