solr lsh word2vec glove information-retrieval search-engine semantic-search vector lucene kmeans kmeans-quantization quantization locality-sensitive-hashing vector-quantization conceptual-search inverted-indexes invertedindex elasticsearch elmo bert

Vectors in Search

Dice.com code for implementing the ideas discussed in the following talks:

'Vectors in Search' - Activate 2018 conference
'Searching with Vectors' - Haystack 2019 conference

This extends my earlier work on 'Conceptual Search' which can be found here - https://github.com/DiceTechJobs/ConceptualSearch (including slides and video links). In this talk, I present a number of different approaches for searching vectors at scale using an inverted index. This implements approaches to Approximate k-Nearest Neighbor Search including:

LSH (using the Sim Hash)
K-Means Tree
Vector Thresholding

and describes how these ideas can be implemented and queried efficiently within an inverted index.

UPDATE: After talking with Trey Grainger and Erik Hatcher from LucidWorks, they recommended using term frequency in place of payloads for the solutions where I embed term weights into the index and use a special payload aware similarity function (which would also not be needed). Payloads incur a significant performance penalty. The challenge with this is the negative weights, I assume it is not possible to encode negative term frequencies, but this can be worked around by having different tokens for positive and negative weighted tokens, and making similar adjustments at query time (where negative boosts can be applied in Solr as needed).

Lucene Documentation: Lucene Delimited Term Frequency Filter

There has also been a recent update to Lucene core that is applicable here and is soon to make it's way into Elastic search at time of writing: Block Max WAND. This produces a signifcant speed up for large boolean OR queries where you don't need to know the exact number of results but just care about getting the top-N results as fast as possible. All of the approaches I discuss here generate relatively large OR queries and so this is very relevant. I have also read that the current implementation of minimum-should-match also includes similar optimizations, and so the same sort of performance gain may already be attained using appropriate mm settings, something that I was already experimenting with in my code.

Directory Structure

python
- Code for implementing the k-means tree, LSH sim hash and vector thresholding algorithms, and indexing and searching vectors in solr using these techniques.
solr_plugins
- Java code for implementing the custom similarity classes and payloadEdismax parser described in the talk.
solr_configs
- Xml snippets for importing the solr plugins from the 'solr_vectors_in_search_plugins' java code.

Implementation Details

Solr Version - 7.5
Python Version - 3.x+ (3.5 used)

Links to Talks

Activate 2018: 'Vectors in Search'
- Slides
- Video
Haystack 2019: 'Searching with Vectors'
- Slides
- Video

Author

Simon Hughes ( Chief Data Scientist, Dice.com )

LinkedIn - https://www.linkedin.com/in/simon-hughes-data-scientist/
Twitter - https://twitter.com/hughes_meister

About

Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015

http://www.dice.com

solr lsh word2vec glove information-retrieval search-engine semantic-search vector lucene kmeans kmeans-quantization quantization locality-sensitive-hashing vector-quantization conceptual-search inverted-indexes invertedindex elasticsearch elmo bert

Apache License 2.0

Languages

Language:Python 46.4%Language:Jupyter Notebook 30.3%Language:Java 23.3%