BrikerMan / Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Home Page:http://kashgari.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] 自定义模型,多个特征输入使用多个embed,模型fit报错,还需要重定义哪些方法来支持?

Zikangli opened this issue · comments

You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed.
请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

Environment

  • OS [e.g. Mac OS, Linux]: linux
  • Python Version: python3.6.12
  • kashgari: 2.0.2

Issue Description

我自定义了一个模型,模型需要输入多种特征(如词、词性、命名实体类别)。词特征用BertEmbedding获取,其他特征用BareEmbedding初始化,然后把它们拼接起来作为模型输入。模型定义都没问题,在相应tasks/labeling/init.py里面也加了,能调用,错误出现在fit的时候。

自定义模型的代码测试抽取如下(省略参数定义),是个序列标注任务:
def init(self,
embedding: ABCEmbedding = None,
posembedding: ABCEmbedding = None,
nerembedding: ABCEmbedding = None,
**kwargs
):
super(BiLSTM_TEST_Model, self).init()
self.embedding = embedding
self.posembedding = posembedding
self.nerembedding = nerembedding

def build_model_arc(self) -> None:
output_dim = self.label_processor.vocab_size

    config = self.hyper_parameters
    embed_model = self.embedding.embed_model
    embed_pos = self.posembedding.embed_model
    embed_ner = self.nerembedding.embed_model

    crf = KConditionalRandomField()
    bilstm = L.Bidirectional(L.LSTM(**config['layer_blstm']), name='layer_blstm')
    bilstm_dropout = L.Dropout(**config['layer_dropout'], name='layer_dropout')
    crf_dropout = L.Dropout(**config['layer_dropout'], name='crflayer_dropout')
    crf_dense = L.Dense(output_dim, **config['layer_time_distributed'])

    ## 三种特征的embed
    tensor_inputs = [tensor]
    model_inputs = [embed_model.inputs]
    if embed_pos != None:
        tensor_inputs.append(embed_pos.output)
        model_inputs.append(embed_pos.inputs)
    if embed_ner != None:
        tensor_inputs.append(embed_ner.output)
        model_inputs.append(embed_ner.inputs)
            
    tensor_con = L.concatenate(tensor_inputs, axis=2)    ## 把所有特征concate起来,作为输入
    bilstm_tensor = bilstm(tensor_con)
    bilstm_dropout_tensor = bilstm_dropout(bilstm_tensor)
    
    crf_dropout_tensor = crf_dropout(bilstm_dropout_tensor)
    crf_dense_tensor = crf_dense(crf_dropout_tensor)
    output = crf(crf_dense_tensor)

    self.tf_model = keras.Model(inputs=model_inputs, outputs=[output])
    self.crf_layer = crf

我在使用数据训练的时候,代码抽取如下:
def trainFunction(.....):
bert_embed = BertEmbedding('./Data/路径', sequence_length=maxlength)
pos_embed = BareEmbedding(embedding_size=32)
ner_embed = BareEmbedding(embedding_size=32)

    selfmodel = BiLSTM_TEST_Model(bert_embed, pos_embed, ner_embed, sequence_length=maxlength)
    history = selfmodel.fit(x_train=(train_x, train_pos_x, train_ner_x,), y_train=train_y, 
                            x_validate=(valid_x, valid_pos_x, valid_ner_x), y_validate=valid_y, batch_size=batchsize, epochs=12)

Reproduce

报错信息如下:
File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 177, in fit
fit_kwargs=fit_kwargs)
File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 208, in fit_generator
self.build_model_generator([g for g in [train_sample_gen, valid_sample_gen] if g])
File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 85, in build_model_generator
self.text_processor.build_vocab_generator(generators)
File "/venv/lib/python3.6/site-packages/kashgari/processors/sequence_processor.py", line 84, in build_vocab_generator
count = token2count.get(token, 0)
TypeError: unhashable type: 'list'

kashgari下build_vocab_generator()报错位置:
def build_vocab_generator(self,
generators: List[CorpusGenerator]) -> None:
if not self.vocab2idx:
vocab2idx = self._initial_vocab_dic

        token2count: Dict[str, int] = {}

        for gen in generators:
            for sentence, label in tqdm.tqdm(gen, desc="Preparing text vocab dict"):
                if self.build_vocab_from_labels:
                    target = label
                else:
                    target = sentence
                for token in target:      ## 我的输入是嵌套list,这里token是每一个list,就报错了。
                    count = token2count.get(token, 0)
                    token2count[token] = count + 1

DEBUG追踪看了下:我fit输入的x_train是三个,CorpusGenerator得到的generators的x也是嵌套的三个list,在build_vocab_generator的时候,就报错了。

我需要重定义build_vocab_generator吗?
除了这个地方,我的模型输入需要三个embedding model,这会导致我的self.vocab2idx/idx2vocab是不是也得定义三种,还有哪些地方需要我重新定义的吗?debug跟着跟着就晕了 T_T

求助!!

是不是模型的text_processor、label_processor也得相应的重新定义?

commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.