Enhance word splitting strategy

Question

Enhance word splitting strategy

ManyTheFish opened this issue 2 years ago · comments

Today the word splitting strategy of the query tree is handled by the function split_best_frequency.
This function split a word into two sub-words by looking at the frequency of the less frequent sub-word of each possible pair.

drawback

However, this frequency computation doesn't represent faithfully the frequency of the pair because these two sub-words can be considered individually frequent without being frequently near together in documents.

possible enhancement

Inspired by the index.rs#word_documents_count method, a new method word_pair_frequency could be implemented in the trait search/query_tree.rs#Context using the word_pair_proximity_docids database instead of the word_docids one.

small warnings

the trait Context is derived 2 times, one in search/query_tree.rs#QueryTreeBuilder which is the real implementation, and one in the tests
the database word_pair_proximity_docids has a different CODEC than word_docids but a length decoder is already implemented in heed_codec/roaring_bitmap_length/

Files expected to be modified

milli/src/search/query_tree.rs

Akshay Kulkarni · Answer 1 · Sat Oct 08 2022 23:52:31 GMT+0800 (China Standard Time)

Hi @ManyTheFish, @curquiza, this sounds interesting, and would like to work on it if not been taken already. I have a few queries, it would be great if you could answer them. Thanks in advance.

Does function split_best_frequency, now needs to search for word pairs in word_pair_proximity_docids apart from individual word search in word_docids?
What will be the value of proximity when we search for pairs in word_pair_proximity_docids? Proximity is of type u8, do we need to search for all proximity values from 0 to 255 for each word split combination? This will not be efficient though.
For writing test cases for this implementation, do we need to add a test database that mimics word_pair_proximity_docids similar to postings used for word_docids in the TestContext?

fuushyn · Answer 2 · Sun Oct 09 2022 19:12:25 GMT+0800 (China Standard Time)

I'd like to work on this issue

Many the fish · Answer 3 · Mon Oct 10 2022 22:53:13 GMT+0800 (China Standard Time)

Hi @ManyTheFish, @curquiza, this sounds interesting, and would like to work on it if not been taken already. I have a few queries, it would be great if you could answer them. Thanks in advance.

Does function split_best_frequency, now needs to search for word pairs in word_pair_proximity_docids apart from individual word search in word_docids?

What will be the value of proximity when we search for pairs in word_pair_proximity_docids? Proximity is of type u8, do we need to search for all proximity values from 0 to 255 for each word split combination? This will not be efficient though.

split_best_frequency should only compute the probability of the pair of words side to side, for the implementation you will only need to check a word_pair_proximity_docidswith proximity of 1. Something like:

let key = (left_word, right_word, 1);
self.index.word_pair_proximity_docids.get(self.rtxn, &key)

you can get more or less copy-paste the method word_pair_proximity_docids in the criteria.

For writing test cases for this implementation, do we need to add a test database that mimics word_pair_proximity_docids similar to postings used for word_docids in the TestContext?

No, it's not necessary

Many the fish · Answer 4 · Mon Oct 10 2022 22:54:47 GMT+0800 (China Standard Time)

Hello @akki1306 and @fuushyn

Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this!
Our contributing guidelines are available here and will guide you through your contribution. If you have any question you can also ask them here!

FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue.
We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.

We are looking forward to reviewing your PR 😊

Akshay Kulkarni · Answer 5 · Wed Oct 12 2022 17:51:27 GMT+0800 (China Standard Time)

@ManyTheFish I am trying to run cargo test and getting the following exception for some of the tests intermittently, the error points to IO error with invalid argument, its not able to create TempIndex.

thread 'update::word_prefix_pair_proximity_docids::tests::test_update' panicked at 'called Result::unwrap()on anErrvalue: IoError(Os { code: 22, kind: InvalidInput, message: "Invalid argument" })', milli/src/index.rs:1224:62

If I use cargo test -- --test-threads=1, this error doesn't occur. Also, on running these tests individually they do not fail.

I am using MACOS Monterey 12.6.

Am I missing something while running the test?

Many the fish · Answer 6 · Wed Oct 12 2022 22:23:55 GMT+0800 (China Standard Time)

Because the tests are running on several threads, you'll have a lot of opened files at the same time, I suggest increasing the limit of opened files.

ulimit -Sn 10000 should fix your issue 😄

Akshay Kulkarni · Answer 7 · Wed Oct 12 2022 23:18:41 GMT+0800 (China Standard Time)

Thanks for the reply @ManyTheFish, I could run the tests successfully after increasing the open file count :). I have raised the PR. Please take a look.