mourjo / akin

Using BM25, find similar documents

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve matches by using cosine similarity

mourjo opened this issue · comments

Right now scores are generated by taking one row's terms as the query. When this row has fewer terms than the row being scored against, the unmatched terms in the second row does not impact the score. Hence the choice of the row with which to match is sensitive to the outcome.

A better match might be to use a term vector for each term in the corpus, where each value is the TF/IDF of the term. Using these vectors, we can use cosine similarity to find how similar or how dissimilar the matching row and candidate rows are, in order to desensitise the outcome from the choice of the row on the left. When a row is selected for finding its best match, unmatched terms on the candidate row will contribute to find a fitter match if possible.