opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.

Home Page:https://opendistro.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question]How cosine similarity distance is calculated when use approximate search

yana2301 opened this issue · comments

Hi, could you please provide more details on how cosine similarity distance is calculated when an approximate search is used.

Documentations states the following:

"From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. To convert distances to Elasticsearch scores, we take 1 / (1 + distance)"

However, cosine similarity means cosine of the angle between vectors. Its higher value means that vectors are similar. So:

angle = 0 cosine similarity = 1 - vectors are very similar
angle = 90 cosine similarity = 0 - vectors are not very similar

So if ES score = 1/(1+cosine similarity) - then it will have a lower value for similar vectors and a higher value for the less similar vectors.
Also, I tried some simple dataset for an approximate search I'm getting following results:

query vector = [ 2.3,3.4,1.2]
vector in dataset = [1.5, 2.5, 1.2]

cosine similarity = (2.3*1.5 + 3.4*2.5+1.2*1.2)/(sqrt(2.3*2.3+3.4*3.4+1.2*1.2)*sqrt(1.5*1.5+2.5*2.5+1.2*1.2) = 0.99307153144

ES score according to formula = 1/(1+0.99307153144) = 0.50173813845
actual ES score = 0.45368767

Could you please provide an exact formula for how ES score is calculated?

Hi @yana2301, yes you are correct - the documentation is wrong. I will make update to that.

So nmslib, for their cosine similarity space, returns 1 - cosine similarity: https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc#L80.

Here, in the k-NN plugin, we compute the Elasticsearch score: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNWeight.java#L113

So whats actually being computed is 1/(1 + (1 - cosine)) Does that make sense?

I will correct documentation to include this.

@jmazanec15 thanks for the response! But still 1/(1 + ( 1 - 0.99307153144)) != 0.45368767

Here's the schema, index data and request that I use to get such results

schema.txt
request.txt
bulk.txt

Yes, correct. But ordering is maintained and thats what we want with the Elasticsearch score.

We can't actually make the Elasticsearch exactly cosineSimilarity because Elasticsearch scores cannot be negative. However, for our scoring script, we take 1 +cosineSimilarity: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringSpace.java#L102. We could do this for approximate k-NN search and it would be clearer, but ordering would be the same as 1/(1 + (1 - cosineSimilarity)), so we have not made that change.

@jmazanec15 I've found out the reason for such metric calculation - #288 .
I'm using ODFE 1.11, I see that it has been fixed in 1.13. I'm using the Amazon Elasticsearch service - thus, I cannot migrate to version 1.13 quickly.
Could you tell me please, are there any workarounds to make approximate search return score according to formula 1/(1 + (1 - cosineSimilarity)) in version 1.11?
The reason why I need this is that we also specify min_score in the queries - so that documents that have low scores wouldn't be returned. Our thresholds and vectors are built using cosine similarity metric. I'm afraid that we would have to rebuild the whole model if the actual formula uses L2.

Oh apologies, I did not read the question correctly.

Yes that fix has been backported to Amazon Elasticsearch Service and ODFE 1.11 - #293.

To make sure your cluster has the fix, make sure you have taken most recent software upgrade.

Then, you will need to reindex your indices to rebuild the graphs using cosine similarity. Please let us know if there is any issue with this.

great, thanks for the clarification!