castorini / pyserini

Hi there!

Thanks for the great work on Pyserini! I had a naive question that I can't seem to find the answer to.

I'd like to use E5 (https://huggingface.co/intfloat/e5-large but other models are similar) and they recommend normalizing the embeddings. I can't find an option for that in Pyserini to do that, when it's not a dense/sparse combination. I'd like to do just dense encodings, but make sure the embeddings are normalized to properly use E5.

I've been using pyserini.encode so far but don't see any options in there for it. Does Pyserini support this?

Hi @orionw,
The AutoDocumentEncoder has the argument l2_norm for initialisation.

pyserini/pyserini/encode/_auto.py

Line 25 in b931e52

    
           def __init__(self, model_name, tokenizer_name=None, device='cuda:0', pooling='cls', l2_norm=False):

However, the option is not exposed in pyserini.encode as an CLI argument.
I'll create a pull request to add this.

Thanks a bunch @MXueguang!

#1722

Normalize embeddings when using a custom dense encoder?