timczm / piccolo-large-zh

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tags model-index
mteb
name results
piccolo-large-zh
task dataset metrics
type
STS
type name config split revision
C-MTEB/AFQMC
MTEB AFQMC
default
validation
None
type value
cos_sim_pearson
51.40548754569409
type value
cos_sim_spearman
54.168222315174376
type value
euclidean_pearson
52.40464973459636
type value
euclidean_spearman
54.26249134589867
type value
manhattan_pearson
52.353782691201246
type value
manhattan_spearman
54.20648078023014
task dataset metrics
type
STS
type name config split revision
C-MTEB/ATEC
MTEB ATEC
default
test
None
type value
cos_sim_pearson
53.4800486876876
type value
cos_sim_spearman
54.27914644842898
type value
euclidean_pearson
56.85762017857563
type value
euclidean_spearman
54.3892743722252
type value
manhattan_pearson
56.812630761505545
type value
manhattan_spearman
54.359667416088556
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_reviews_multi
MTEB AmazonReviewsClassification (zh)
zh
test
1399c76144fd37290681b995c656ef9b2e06e26d
type value
accuracy
40.33200000000001
type value
f1
39.56855261607718
task dataset metrics
type
STS
type name config split revision
C-MTEB/BQ
MTEB BQ
default
test
None
type value
cos_sim_pearson
60.81359612041921
type value
cos_sim_spearman
62.3148582435008
type value
euclidean_pearson
61.21668579008443
type value
euclidean_spearman
62.3526204140884
type value
manhattan_pearson
61.1558631086567
type value
manhattan_spearman
62.287696221478384
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringP2P
MTEB CLSClusteringP2P
default
test
None
type value
v_measure
38.98356815428385
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringS2S
MTEB CLSClusteringS2S
default
test
None
type value
v_measure
36.04329998232363
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv1-reranking
MTEB CMedQAv1
default
test
None
type value
map
84.79178620472841
type value
mrr
87.1725
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv2-reranking
MTEB CMedQAv2
default
test
None
type value
map
84.89085057036931
type value
mrr
87.46011904761905
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CmedqaRetrieval
MTEB CmedqaRetrieval
default
dev
None
type value
map_at_1
23.351
type value
map_at_10
35.284
type value
map_at_100
37.222
type value
map_at_1000
37.338
type value
map_at_3
31.135
type value
map_at_5
33.445
type value
mrr_at_1
36.134
type value
mrr_at_10
44.282
type value
mrr_at_100
45.31
type value
mrr_at_1000
45.356
type value
mrr_at_3
41.615
type value
mrr_at_5
43.169000000000004
type value
ndcg_at_1
36.134
type value
ndcg_at_10
41.982
type value
ndcg_at_100
49.672
type value
ndcg_at_1000
51.669
type value
ndcg_at_3
36.521
type value
ndcg_at_5
38.858
type value
precision_at_1
36.134
type value
precision_at_10
9.515
type value
precision_at_100
1.5779999999999998
type value
precision_at_1000
0.183
type value
precision_at_3
20.747
type value
precision_at_5
15.229000000000001
type value
recall_at_1
23.351
type value
recall_at_10
52.798
type value
recall_at_100
84.806
type value
recall_at_1000
98.172
type value
recall_at_3
36.513
type value
recall_at_5
43.701
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/CMNLI
MTEB Cmnli
default
validation
None
type value
cos_sim_accuracy
74.74443776307878
type value
cos_sim_ap
83.8325812952643
type value
cos_sim_f1
76.64593609264422
type value
cos_sim_precision
70.78629431570607
type value
cos_sim_recall
83.56324526537293
type value
dot_accuracy
73.91461214672279
type value
dot_ap
82.8769105611689
type value
dot_f1
75.93478260869564
type value
dot_precision
70.95267113548648
type value
dot_recall
81.66939443535188
type value
euclidean_accuracy
74.94888755261574
type value
euclidean_ap
84.00606427216371
type value
euclidean_f1
76.78665681410322
type value
euclidean_precision
69.99615088529639
type value
euclidean_recall
85.0362403553893
type value
manhattan_accuracy
74.92483463619965
type value
manhattan_ap
83.97546171072935
type value
manhattan_f1
76.57105320779506
type value
manhattan_precision
71.99917644636606
type value
manhattan_recall
81.7629179331307
type value
max_accuracy
74.94888755261574
type value
max_ap
84.00606427216371
type value
max_f1
76.78665681410322
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CovidRetrieval
MTEB CovidRetrieval
default
dev
None
type value
map_at_1
73.34
type value
map_at_10
81.462
type value
map_at_100
81.661
type value
map_at_1000
81.663
type value
map_at_3
79.742
type value
map_at_5
80.886
type value
mrr_at_1
73.656
type value
mrr_at_10
81.432
type value
mrr_at_100
81.632
type value
mrr_at_1000
81.634
type value
mrr_at_3
79.786
type value
mrr_at_5
80.87100000000001
type value
ndcg_at_1
73.656
type value
ndcg_at_10
85.036
type value
ndcg_at_100
85.83
type value
ndcg_at_1000
85.884
type value
ndcg_at_3
81.669
type value
ndcg_at_5
83.699
type value
precision_at_1
73.656
type value
precision_at_10
9.715
type value
precision_at_100
1.005
type value
precision_at_1000
0.101
type value
precision_at_3
29.293999999999997
type value
precision_at_5
18.587999999999997
type value
recall_at_1
73.34
type value
recall_at_10
96.101
type value
recall_at_100
99.473
type value
recall_at_1000
99.895
type value
recall_at_3
87.197
type value
recall_at_5
92.044
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/DuRetrieval
MTEB DuRetrieval
default
dev
None
type value
map_at_1
26.351999999999997
type value
map_at_10
80.977
type value
map_at_100
83.795
type value
map_at_1000
83.836
type value
map_at_3
56.388000000000005
type value
map_at_5
71.089
type value
mrr_at_1
90.75
type value
mrr_at_10
93.648
type value
mrr_at_100
93.71000000000001
type value
mrr_at_1000
93.714
type value
mrr_at_3
93.43299999999999
type value
mrr_at_5
93.57600000000001
type value
ndcg_at_1
90.75
type value
ndcg_at_10
87.971
type value
ndcg_at_100
90.594
type value
ndcg_at_1000
90.998
type value
ndcg_at_3
87.224
type value
ndcg_at_5
86.032
type value
precision_at_1
90.75
type value
precision_at_10
41.975
type value
precision_at_100
4.807
type value
precision_at_1000
0.48900000000000005
type value
precision_at_3
78.167
type value
precision_at_5
65.85
type value
recall_at_1
26.351999999999997
type value
recall_at_10
88.714
type value
recall_at_100
97.367
type value
recall_at_1000
99.589
type value
recall_at_3
58.483
type value
recall_at_5
75.359
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/EcomRetrieval
MTEB EcomRetrieval
default
dev
None
type value
map_at_1
46.2
type value
map_at_10
56.548
type value
map_at_100
57.172
type value
map_at_1000
57.192
type value
map_at_3
53.983000000000004
type value
map_at_5
55.408
type value
mrr_at_1
46.2
type value
mrr_at_10
56.548
type value
mrr_at_100
57.172
type value
mrr_at_1000
57.192
type value
mrr_at_3
53.983000000000004
type value
mrr_at_5
55.408
type value
ndcg_at_1
46.2
type value
ndcg_at_10
61.912
type value
ndcg_at_100
64.834
type value
ndcg_at_1000
65.36
type value
ndcg_at_3
56.577
type value
ndcg_at_5
59.15899999999999
type value
precision_at_1
46.2
type value
precision_at_10
7.89
type value
precision_at_100
0.923
type value
precision_at_1000
0.096
type value
precision_at_3
21.367
type value
precision_at_5
14.08
type value
recall_at_1
46.2
type value
recall_at_10
78.9
type value
recall_at_100
92.30000000000001
type value
recall_at_1000
96.39999999999999
type value
recall_at_3
64.1
type value
recall_at_5
70.39999999999999
task dataset metrics
type
Classification
type name config split revision
C-MTEB/IFlyTek-classification
MTEB IFlyTek
default
validation
None
type value
accuracy
44.24778761061947
type value
f1
36.410133889743115
task dataset metrics
type
Classification
type name config split revision
C-MTEB/JDReview-classification
MTEB JDReview
default
test
None
type value
accuracy
86.09756097560975
type value
ap
53.85203082125175
type value
f1
80.61318243910114
task dataset metrics
type
STS
type name config split revision
C-MTEB/LCQMC
MTEB LCQMC
default
test
None
type value
cos_sim_pearson
70.49411615067606
type value
cos_sim_spearman
75.80607876548899
type value
euclidean_pearson
74.67002802430761
type value
euclidean_spearman
76.00290181304833
type value
manhattan_pearson
74.66745498313495
type value
manhattan_spearman
76.00460005446307
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MMarcoRetrieval
MTEB MMarcoRetrieval
default
dev
None
type value
map_at_1
64.388
type value
map_at_10
73.94800000000001
type value
map_at_100
74.279
type value
map_at_1000
74.29
type value
map_at_3
72.017
type value
map_at_5
73.29599999999999
type value
mrr_at_1
66.648
type value
mrr_at_10
74.59599999999999
type value
mrr_at_100
74.885
type value
mrr_at_1000
74.896
type value
mrr_at_3
72.88900000000001
type value
mrr_at_5
74.017
type value
ndcg_at_1
66.648
type value
ndcg_at_10
77.833
type value
ndcg_at_100
79.306
type value
ndcg_at_1000
79.605
type value
ndcg_at_3
74.18599999999999
type value
ndcg_at_5
76.352
type value
precision_at_1
66.648
type value
precision_at_10
9.472999999999999
type value
precision_at_100
1.0210000000000001
type value
precision_at_1000
0.105
type value
precision_at_3
28.055999999999997
type value
precision_at_5
17.974
type value
recall_at_1
64.388
type value
recall_at_10
89.143
type value
recall_at_100
95.794
type value
recall_at_1000
98.152
type value
recall_at_3
79.55499999999999
type value
recall_at_5
84.694
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_intent
MTEB MassiveIntentClassification (zh-CN)
zh-CN
test
31efe3c427b0bae9c22cbb560b8f15491cc6bed7
type value
accuracy
67.99932750504371
type value
f1
66.07217986916525
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_scenario
MTEB MassiveScenarioClassification (zh-CN)
zh-CN
test
7d571f92784cd94a019292a1f45445077d0ef634
type value
accuracy
72.08137188971082
type value
f1
72.42255159515156
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MedicalRetrieval
MTEB MedicalRetrieval
default
dev
None
type value
map_at_1
49.6
type value
map_at_10
56.04
type value
map_at_100
56.584999999999994
type value
map_at_1000
56.637
type value
map_at_3
54.7
type value
map_at_5
55.505
type value
mrr_at_1
49.7
type value
mrr_at_10
56.094
type value
mrr_at_100
56.638999999999996
type value
mrr_at_1000
56.691
type value
mrr_at_3
54.75
type value
mrr_at_5
55.54
type value
ndcg_at_1
49.6
type value
ndcg_at_10
59.038000000000004
type value
ndcg_at_100
61.964
type value
ndcg_at_1000
63.482000000000006
type value
ndcg_at_3
56.297
type value
ndcg_at_5
57.743
type value
precision_at_1
49.6
type value
precision_at_10
6.84
type value
precision_at_100
0.828
type value
precision_at_1000
0.095
type value
precision_at_3
20.3
type value
precision_at_5
12.879999999999999
type value
recall_at_1
49.6
type value
recall_at_10
68.4
type value
recall_at_100
82.8
type value
recall_at_1000
95.1
type value
recall_at_3
60.9
type value
recall_at_5
64.4
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/Mmarco-reranking
MTEB MMarcoReranking
default
dev
None
type value
map
27.274633976199482
type value
mrr
25.85952380952381
task dataset metrics
type
Classification
type name config split revision
C-MTEB/MultilingualSentiment-classification
MTEB MultilingualSentiment
default
validation
None
type value
accuracy
70.15
type value
f1
70.12595878910165
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/OCNLI
MTEB Ocnli
default
validation
None
type value
cos_sim_accuracy
68.05630752571737
type value
cos_sim_ap
72.9224765568519
type value
cos_sim_f1
72.97297297297295
type value
cos_sim_precision
62.1380846325167
type value
cos_sim_recall
88.3843717001056
type value
dot_accuracy
68.11044937736871
type value
dot_ap
72.84095585142163
type value
dot_f1
72.59574468085107
type value
dot_precision
60.79828937990022
type value
dot_recall
90.07391763463569
type value
euclidean_accuracy
67.73145641580942
type value
euclidean_ap
72.8584903276338
type value
euclidean_f1
72.82095319879778
type value
euclidean_precision
61.3603473227207
type value
euclidean_recall
89.54593453009504
type value
manhattan_accuracy
67.56903086085543
type value
manhattan_ap
72.81719990959621
type value
manhattan_f1
72.95855560114896
type value
manhattan_precision
59.664429530201346
type value
manhattan_recall
93.8753959873284
type value
max_accuracy
68.11044937736871
type value
max_ap
72.9224765568519
type value
max_f1
72.97297297297295
task dataset metrics
type
Classification
type name config split revision
C-MTEB/OnlineShopping-classification
MTEB OnlineShopping
default
test
None
type value
accuracy
90.27
type value
ap
87.25468287842568
type value
f1
90.24230569233008
task dataset metrics
type
STS
type name config split revision
C-MTEB/PAWSX
MTEB PAWSX
default
test
None
type value
cos_sim_pearson
34.445576951449894
type value
cos_sim_spearman
38.3120125820568
type value
euclidean_pearson
38.80156903904639
type value
euclidean_spearman
38.240808371401656
type value
manhattan_pearson
38.77317222891622
type value
manhattan_spearman
38.230008722746646
task dataset metrics
type
STS
type name config split revision
C-MTEB/QBQTC
MTEB QBQTC
default
test
None
type value
cos_sim_pearson
37.990494014067295
type value
cos_sim_spearman
38.218416274161385
type value
euclidean_pearson
35.91543518481725
type value
euclidean_spearman
37.34947320962178
type value
manhattan_pearson
35.90653204921896
type value
manhattan_spearman
37.3484819621432
task dataset metrics
type
STS
type name config split revision
mteb/sts22-crosslingual-sts
MTEB STS22 (zh)
zh
test
None
type value
cos_sim_pearson
66.10227125673059
type value
cos_sim_spearman
66.65529695940144
type value
euclidean_pearson
64.41045931064728
type value
euclidean_spearman
66.48371335308076
type value
manhattan_pearson
64.40881380301438
type value
manhattan_spearman
66.4530857331391
task dataset metrics
type
STS
type name config split revision
C-MTEB/STSB
MTEB STSB
default
test
None
type value
cos_sim_pearson
74.46374847096926
type value
cos_sim_spearman
74.42746155066217
type value
euclidean_pearson
74.29184569507011
type value
euclidean_spearman
74.88985827017852
type value
manhattan_pearson
74.28083071864158
type value
manhattan_spearman
74.8848458821044
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/T2Reranking
MTEB T2Reranking
default
dev
None
type value
map
66.95528971496414
type value
mrr
77.09135312892928
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/T2Retrieval
MTEB T2Retrieval
default
dev
None
type value
map_at_1
26.531
type value
map_at_10
74.504
type value
map_at_100
78.321
type value
map_at_1000
78.393
type value
map_at_3
52.288000000000004
type value
map_at_5
64.228
type value
mrr_at_1
88.331
type value
mrr_at_10
91.044
type value
mrr_at_100
91.156
type value
mrr_at_1000
91.161
type value
mrr_at_3
90.55499999999999
type value
mrr_at_5
90.857
type value
ndcg_at_1
88.331
type value
ndcg_at_10
82.468
type value
ndcg_at_100
86.494
type value
ndcg_at_1000
87.211
type value
ndcg_at_3
83.979
type value
ndcg_at_5
82.40899999999999
type value
precision_at_1
88.331
type value
precision_at_10
41.223
type value
precision_at_100
4.984
type value
precision_at_1000
0.515
type value
precision_at_3
73.603
type value
precision_at_5
61.634
type value
recall_at_1
26.531
type value
recall_at_10
81.432
type value
recall_at_100
94.404
type value
recall_at_1000
98.085
type value
recall_at_3
54.055
type value
recall_at_5
67.726
task dataset metrics
type
Classification
type name config split revision
C-MTEB/TNews-classification
MTEB TNews
default
validation
None
type value
accuracy
46.543
type value
f1
45.26659807296124
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringP2P
MTEB ThuNewsClusteringP2P
default
test
None
type value
v_measure
60.575199180159586
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringS2S
MTEB ThuNewsClusteringS2S
default
test
None
type value
v_measure
52.55759510188472
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/VideoRetrieval
MTEB VideoRetrieval
default
dev
None
type value
map_at_1
56.2
type value
map_at_10
66.497
type value
map_at_100
66.994
type value
map_at_1000
67.012
type value
map_at_3
64.483
type value
map_at_5
65.783
type value
mrr_at_1
56.2
type value
mrr_at_10
66.497
type value
mrr_at_100
66.994
type value
mrr_at_1000
67.012
type value
mrr_at_3
64.483
type value
mrr_at_5
65.783
type value
ndcg_at_1
56.2
type value
ndcg_at_10
71.18100000000001
type value
ndcg_at_100
73.411
type value
ndcg_at_1000
73.819
type value
ndcg_at_3
67.137
type value
ndcg_at_5
69.461
type value
precision_at_1
56.2
type value
precision_at_10
8.57
type value
precision_at_100
0.9570000000000001
type value
precision_at_1000
0.099
type value
precision_at_3
24.933
type value
precision_at_5
16.08
type value
recall_at_1
56.2
type value
recall_at_10
85.7
type value
recall_at_100
95.7
type value
recall_at_1000
98.8
type value
recall_at_3
74.8
type value
recall_at_5
80.4
task dataset metrics
type
Classification
type name config split revision
C-MTEB/waimai-classification
MTEB Waimai
default
test
None
type value
accuracy
85.54
type value
ap
68.02479661585015
type value
f1
83.87871999963863

piccolo-large-zh

piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。 在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。 在第二阶段中,我们搜集整理了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。 目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。

piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research. Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet, and train the model with the pair(text and text pos) softmax contrastive loss. On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss. Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.

Metric

我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在eval文件夹中提供了复现结果的脚本。

We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard. we provide scripts in "eval" folder for results reproducing.

Model Name Model Size (GB) Dimension Sequence Length Average (35) Classification (9) Clustering (4) Pair Classification (2) Reranking (4) Retrieval (8) STS (8)
[piccolo-large-zh] 0.65 1024 512 64.11 67.03 47.04 78.38 65.98 70.93 58.02
[bge-large-zh] 1.3 1024 512 63.96 68.32 48.39 78.94 65.11 71.52 54.98
[piccolo-base-zh] 0.2 768 512 63.66 66.98 47.12 76.61 66.68 71.2 55.9
[bge-large-zh-no-instruct] 1.3 1024 512 63.4 68.58 50.01 76.77 64.9 70.54 53
[bge-base-zh] 0.41 768 512 62.8 67.07 47.64 77.5 64.91 69.53 54.12

Usage

在sentence-transformer package中可以很容易地调用piccolo模型

# for s2s dataset, you can use piccolo as below
# 对于短对短数据集,下面是通用的使用方式
from sentence_transformers import SentenceTransformer
sentences = ["数据1", "数据2"]
model = SentenceTransformer('sensenova/piccolo-base-zh')
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# for s2p dataset, we recommend to add instruction for passage retrieval
# 对于短对长数据集,我们推荐添加instruction,来帮助模型更好地进行检索。
from sentence_transformers import SentenceTransformer
queries = ['query_1', 'query_2']
passages = ["doc_1", "doc_2"]
model = SentenceTransformer('sensenova/piccolo-base-zh')
q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T

Training Detail

pretrain

pretrain 通常不需要太大的max length, 推荐128。小的max length用以提高batch size,加快训练速度,从而适应大规模数据。 pretrain 损失我们采用二元组contrastive loss,不加入hard negative, 直接采用inbatch negative,在实际训练中,我们使用了32张40G A100进行训练,单卡的batch size为1024。

Pretrain usually does not require a large max length, and 128 is recommended. A small max length is used to increase batch size and speed up training to adapt to large-scale data. We use binary contrastive loss for pretrain loss, without adding hard negative, and directly use inbatch negative. In actual training, we used 32 40G A100 for training, and the batch size of a single card is 1024.

finetune

finetune 通常会将 max length扩增到512。用以适应更大长度的文本输入,finetune时会多sample S2P的数据,以增强模型在retrieval任务上的性能。 finetune 损失采用三元组contrastive loss,加入hard negative,neg num通常设置为2-7,loss计算方式可以参考GTE里的improved contrastive loss。 注意: 我们给query和passage设置了不同的max length,query的max length始终保持在64。

For finetuning, we usually expands the max length to 512. To adapt to larger length text input, finetune will sample more S2P data to enhance the performance of the model on retrieval tasks. The finetune loss uses triple contrastive loss, adding hard negative. Neg num is usually set to 2-7. The loss calculation method can refer to the improved contrastive loss in GTE. Note: We set different max lengths for query and passage, and the max length of query is always kept at 64.

Others

一些有用的trick:

  1. 减小显存的方式: fp16 + gradient checkpointing + ZERO STAGE1 (stage2 不支持双塔结构下的gradient checkpointing) 相关issue见: microsoft/DeepSpeed#988
  2. dataset sampler,我们采用了M3E的dataset sampler,用以保证每个batch里的样本均来自于一个dataset,负样本更有价值。
  3. instruction。instruction在我们的实验中对retrieval任务有非常大的性能提升,我们在每个训练样本前都加入'查询: '和'结果: '这样的instruction。

some useful tricks:

  1. The way to reduce memory usage: fp16 + gradient checkpointing + ZERO STAGE1 (stage2 does not support gradient checkpointing under the double-tower structure) For related issues, see: https://github.com/microsoft/DeepSpeed/issues/ 988
  2. Dataset sampler, we use M3E's dataset sampler to ensure that the samples in each batch come from a dataset, and negative samples are more valuable.
  3. instruction. Instruction has greatly improved the performance of the retrieval task in our experiments. We added instructions like 'query: ' and 'result: ' before each training sample.

Reference

这里我们列出了我们参考过的embedding项目和论文

  1. M3E。非常棒的中文开源embedding项目,收集和整理了较多的中文高质量数据集,uniem也是一个不错的框架。
  2. Text2vec。另一个一个非常棒的中文开源embedding项目。
  3. FlagEmbedding。智源AI开源的embedding模型,收集和整理了CMTEB benchmark,填补了中文embedding系统性评测的空缺。
  4. E5。来自微软的一篇文章,有非常详细的消融实验以及数据处理过滤细节。
  5. GTE。一篇来自阿里达摩的embedding论文。

Here we list the embedding projects and papers we have referenced

  1. M3E. A great Chinese open source embedding project that collects and organizes a large number of high-quality Chinese datasets. Uniem is also a good framework.
  2. Text2vec. Another great Chinese open source embedding project.
  3. Flag Embedding. Zhiyuan AI’s open source embedding model.They collect and organize CMTEB benchmark, filling the gap in systematic evaluation of Chinese embeddings.
  4. E5. Powerd by microsoft,producing very detailed ablation experiments and data processing filtering details.
  5. GTE. An embedding paper from Alibaba Damo.

License

Piccolo 使用 MIT License,免费商用。 Piccolo use MIT License. It can be used for commercial purposes free of charge.

Acknowledgement

piccolo 由来自商汤科技研究院的通用模型组完成训练,Jinkin 完成了代码实现和模型训练, Jinkin, CCCCxxx 一起完成了数据搜集、整理和评测工作. 项目由 Gaomengyachaorenwu111 主导。 同时,感谢lux0933以及yangkai001的交流与讨论,提供了非常多有用的建议。

piccolo is powered by Genral Model group from SenseTime Research. Jinkin complete code implementation and model training. Jinkin, CCCCxxx completed the data collection、processing and model evaluation together. Project is led by Gaomengya and chaorenwu111. At the same time, thank lux0933 and yangkai001 for the discussion, which provide a lot of useful suggestions.

About