ZhuiyiTechnology / simbert

a bert for retrieval and generation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

在lcqmc数据集上微调效果下降

elihuan1990 opened this issue · comments

在lcqmc数据集上微调simbert,在测试集上spearman指标下降一个点,怎么微调simbert呢?

可以用sentence-bert的方式微调

請問simbert.py訓練完模型並儲存best_model.weights了
我要如何加載best_model.weights模型並測試
`from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model
from keras.models import Model
import numpy as np

config_path = '/home/rca/research/simbert/root/kg/bert/chinese_simbert_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './latest_model.ckpt'
dict_path = '/home/rca/research/simbert/root/kg/bert/chinese_simbert_L-12_H-768_A-12/vocab.txt'

tokenizer = Tokenizer(dict_path, do_lower_case=True)

bert = build_transformer_model(
config_path,
checkpoint_path,
with_pool='linear',
application='unilm',
return_keras_model=False,
)
model = Model(inputs=bert.model.inputs, outputs=bert.model.outputs)
model.load_weights(checkpoint_path, by_name=True) # 加载权重时需要加上 by_name=True

test_sentence = "微信和支付宝哪个好?"

def gen_similar_sentences(text, n=10, k=10):
similar_sentences = gen_synonyms(text, n, k) # 需要定义 gen_synonyms 函数
return similar_sentences

token_ids, segment_ids = tokenizer.encode(test_sentence, max_length=maxlen)

output_ids = model.predict([np.array([token_ids]), np.array([segment_ids])])
output_ids = output_ids[0].argmax(axis=1)

generated_sentence = tokenizer.decode(output_ids)

print(f"原句子:{test_sentence}")
print(f"生成句子:{generated_sentence}")
print("相似句子:")
similar_sentences = gen_similar_sentences(test_sentence)
for idx, sentence in enumerate(similar_sentences):
print(f"{idx + 1}. {sentence}")`
是這樣寫嗎

我的方法是直接 from simbert import gen_synonyms,这样模型会加载新的权重