Normalize embeddings when using a custom dense encoder?
orionw opened this issue · comments
Hi there!
Thanks for the great work on Pyserini! I had a naive question that I can't seem to find the answer to.
I'd like to use E5 (https://huggingface.co/intfloat/e5-large but other models are similar) and they recommend normalizing the embeddings. I can't find an option for that in Pyserini to do that, when it's not a dense/sparse combination. I'd like to do just dense encodings, but make sure the embeddings are normalized to properly use E5.
I've been using pyserini.encode
so far but don't see any options in there for it. Does Pyserini support this?
Hi @orionw,
The AutoDocumentEncoder has the argument l2_norm
for initialisation.
pyserini/pyserini/encode/_auto.py
Line 25 in b931e52
However, the option is not exposed in
pyserini.encode
as an CLI argument.I'll create a pull request to add this.
Thanks a bunch @MXueguang!