castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page:http://pyserini.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Normalize embeddings when using a custom dense encoder?

orionw opened this issue · comments

Hi there!

Thanks for the great work on Pyserini! I had a naive question that I can't seem to find the answer to.

I'd like to use E5 (https://huggingface.co/intfloat/e5-large but other models are similar) and they recommend normalizing the embeddings. I can't find an option for that in Pyserini to do that, when it's not a dense/sparse combination. I'd like to do just dense encodings, but make sure the embeddings are normalized to properly use E5.

I've been using pyserini.encode so far but don't see any options in there for it. Does Pyserini support this?

Hi @orionw,
The AutoDocumentEncoder has the argument l2_norm for initialisation.

def __init__(self, model_name, tokenizer_name=None, device='cuda:0', pooling='cls', l2_norm=False):

However, the option is not exposed in pyserini.encode as an CLI argument.
I'll create a pull request to add this.

Thanks a bunch @MXueguang!