oshizo / JapaneseEmbeddingEval

Repository from Github https://github.comoshizo/JapaneseEmbeddingEvalRepository from Github https://github.comoshizo/JapaneseEmbeddingEval

⚠️ 重要 2024/10/8 より多様なタスクにより埋め込みモデルを評価したリーダーボードJMTEBが公開されておりますので、こちらを参照することをお勧めします。
⚠️ IMPORTANT UPDATE: we recommend checking out JMTEB, a new leaderboard that evaluates embedding models using a more diverse set of tasks.

JapaneseEmbeddingEval

  • JSTS/JSICK: Spearman's rank correlation coefficient
    • Cosine similarity was used to calculate the similarity of sentence pairs.
  • MIRACL: top30 recall
Model #dims #params JSTS valid-v1.1 JSICK test MIRACL dev Average
BAAI/bge-m3(dense_vecs) 1024 567M 0.802 0.798 0.9101 0.837
jinaai/jina-embeddings-v3 1024 12M 0.819 0.782 0.862 0.821
MU-Kindai/SBERT-JSNLI-base 768 110M 0.766 0.652 0.326 0.581
MU-Kindai/SBERT-JSNLI-large 1024 337M 0.774 0.677 0.278 0.576
bclavie/fio-base-japanese-v0.1 2 768 111M 0.863 0.894 0.718 0.825
cl-nagoya/ruri-small 768 67M 0.821 0.833 0.7911 0.815
cl-nagoya/ruri-base 768 111M 0.833 0.823 0.8461 0.834
cl-nagoya/ruri-large 1024 337M 0.842 0.819 0.8641 0.842
cl-nagoya/sup-simcse-ja-base 768 111M 0.809 0.827 0.527 0.721
cl-nagoya/sup-simcse-ja-large 1024 337M 0.831 0.831 0.507 0.723
cl-nagoya/unsup-simcse-ja-base 768 111M 0.789 0.790 0.487 0.689
cl-nagoya/unsup-simcse-ja-large 1024 337M 0.814 0.796 0.485 0.699
colorfulscoop/sbert-base-ja 768 110M 0.742 0.657 0.254 0.551
intfloat/multilingual-e5-small 384 117M 0.789 0.814 0.8471 0.817
intfloat/multilingual-e5-base 768 278M 0.796 0.806 0.8451 0.816
intfloat/multilingual-e5-large 1024 559M 0.819 0.794 0.8831 0.832
intfloat/multilingual-e5-large-instruct 1024 559M 0.832 0.822 0.8761 0.844
oshizo/sbert-jsnli-luke-japanese-base-lite 768 133M 0.811 0.726 0.497 0.678
pkshatech/GLuCoSE-base-ja-v2 768 133M 0.809 0.849 0.8791 0.846
pkshatech/RoSEtta-base-ja 768 190M 0.790 0.835 0.8451 0.823
pkshatech/GLuCoSE-base-ja 768 133M 0.818 0.757 0.692 0.755
pkshatech/simcse-ja-bert-base-clcmlp 768 111M 0.801 0.735 0.544 0.693
API
text-embedding-3-large 3072 0.838 0.812 0.8413 0.830
text-embedding-3-small 1536 0.781 0.804 0.7953 0.793
text-embedding-ada-002 1536 0.790 0.790 0.7283 0.769
textembedding-gecko-multilingual@001 768 0.801 0.804 0.8003 0.801
LLM
intfloat/e5-mistral-7b-instruct 4096 7.3B 0.836 0.836 0.885 0.852
oshizo/japanese-e5-mistral-7b_slerp 4096 7.3B 0.846 0.842 0.886 0.858
oshizo/japanese-e5-mistral-1.9b 4096 1.9B 0.826 0.833 0.797 0.819
ColBERT
bclavie/jacolbert_first_100 4 128/token 111M 0.8723
bclavie/JaColBERTv2 4 128/token 111M 0.9183
BAAI/bge-m3(colbert_vecs) 1024/token 567M 0.799 0.798 0.9171 0.838
BAAI/bge-m3(colbert+sparse+dense) 1024/token5 567M 0.800 0.805 0.926 1 0.844
Reranker
hotchpotch/japanese-bge-reranker-v2-m3-v1 - 567M 0.9471
Sparse Retrieval
hotchpotch/japanese-splade-base-v1 - 111M 0.9251

Datasets

  • JSTS valid-v1.1

  • JSICK test

  • MIRACL dev

    • https://huggingface.co/datasets/miracl/miracl
    • 860 japanese queries
    • From the 6,953,614 japanese data in miracl/miracl-corpus, the sentences to be searched were selected as follows to reduce computation time.
      1. positive passage for each query
      2. 300 hard negatives for each query
      • Hard negative mining was performed using intfloat/multilingual-e5-base
      • Scores for models other than intfloat/multilingual-e5-base are calculated higher only in the following case, but we believe that they are almost unaffected.
        • A negative that is ranked lower than the top 300 by intfloat/multilingual-e5-base is ranked within the top 30 by that model, which pushes the positive into the top 30 or lower.
    • Some queries contain more than 30 potential positive documents in the miracl-corpus. In this case, even a very good model may not be able to rank the ground truth positive documents within the top 30. We estimated such queries to be about 7% to 10% of the total 860 queries. This number was estimated by referring to the tydiqa data for the same query as the corresponding miracl dev query and counting whether the tydiqa answer phrase was in at least 30 of the 300 hard negatives documents.

Footnotes

  1. These models have been fine-tuned using the MIRACL dataset, so the MIRACL task is not an unseen task for them. For detailed information on each model, please refer to the following links: multilingual-e5, BGE-M3, hotchpotch/japanese-bge-reranker-v2-m3-v1, hotchpotch/japanese-splade-base-v1, Ruri, pkshatech/GLuCoSE-base-ja-v2, pkshatech/RoSEtta-base-ja 2 3 4 5 6 7 8 9 10 11 12 13 14

  2. According to the blog post about fio-base-japanese-v0.1, the tasks aren't unseen by the model, which makes it hard to directly compare with the other models.

  3. Evaluate only the first 100 queries out of 860 queries 2 3 4 5 6

  4. JaColBERT is a retrieval model. It is optimised only for document retrieval tasks, and not for semantic similarity/entailment tasks like JSTS or JSICK. 2

  5. Embedded dimension for dence is 1024, sparse is one float value per unique token, colbert is 1024 per token.

About


Languages

Language:Jupyter Notebook 100.0%