在lcqmc数据集上微调效果下降

Question

在lcqmc数据集上微调效果下降

elihuan1990 opened this issue 3 years ago · comments

elihuan1990 commented 3 years ago

在lcqmc数据集上微调simbert，在测试集上spearman指标下降一个点，怎么微调simbert呢？

苏剑林(Jianlin Su) · Answer 1 · Wed Dec 29 2021 12:08:22 GMT+0800 (China Standard Time)

可以用sentence-bert的方式微调

WenTingTseng · Answer 2 · Sun Aug 27 2023 00:29:02 GMT+0800 (China Standard Time)

請問simbert.py訓練完模型並儲存best_model.weights了
我要如何加載best_model.weights模型並測試
`from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model
from keras.models import Model
import numpy as np

config_path = '/home/rca/research/simbert/root/kg/bert/chinese_simbert_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './latest_model.ckpt'
dict_path = '/home/rca/research/simbert/root/kg/bert/chinese_simbert_L-12_H-768_A-12/vocab.txt'

tokenizer = Tokenizer(dict_path, do_lower_case=True)

bert = build_transformer_model(
config_path,
checkpoint_path,
with_pool='linear',
application='unilm',
return_keras_model=False,
)
model = Model(inputs=bert.model.inputs, outputs=bert.model.outputs)
model.load_weights(checkpoint_path, by_name=True) # 加载权重时需要加上 by_name=True

test_sentence = "微信和支付宝哪个好？"

def gen_similar_sentences(text, n=10, k=10):
similar_sentences = gen_synonyms(text, n, k) # 需要定义 gen_synonyms 函数
return similar_sentences

token_ids, segment_ids = tokenizer.encode(test_sentence, max_length=maxlen)

output_ids = model.predict([np.array([token_ids]), np.array([segment_ids])])
output_ids = output_ids[0].argmax(axis=1)

generated_sentence = tokenizer.decode(output_ids)

print(f"原句子：{test_sentence}")
print(f"生成句子：{generated_sentence}")
print("相似句子：")
similar_sentences = gen_similar_sentences(test_sentence)
for idx, sentence in enumerate(similar_sentences):
print(f"{idx + 1}. {sentence}")`
是這樣寫嗎

Helennnnn · Answer 3 · Mon Feb 19 2024 14:17:01 GMT+0800 (China Standard Time)

我的方法是直接 from simbert import gen_synonyms，这样模型会加载新的权重