wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit

Home Page:https://wenet-e2e.github.io/wenet/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[paraformer] When is ONNX GPU export supported.

willnufe opened this issue · comments

0. 【问题】[paraformer]When is ONNX GPU export supported?

1. 版本**【wenet-v3.0.1】**

2. 尝试对 paraformer onnx gpu 进行转换

  1. 基于下面的forward函数[wenet-main/examples/aishell/paraformer/wenet/paraformer/paraformer.py],我做了paraformer onnx gpu的转换尝试,
    def forward_paraformer(
        self,
        speech: torch.Tensor,
        speech_lengths: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        res = self._forward_paraformer(speech, speech_lengths)
        return res['decoder_out'], res['decoder_out_lens'], res['tp_alphas']
  1. 但是 dynamic_axes 并没有生效,只能处理与 speech, speech_lengths 完全相同shape的数据,同时,speech源音频长度达到一分钟时,模型转换和onnxruntime-gpu 推理时的加载速度都非常慢。
torch.onnx.export(
            model,
            (speech, speech_lengths),
            model_path,
            export_params=True,
            opset_version=13,
            do_constant_folding=True,
            input_names=["speech", "speech_lengths"],
            output_names=[
                "decoder_out",
                "decoder_out_lens",
                "tp_alphas"
            ],
            dynamic_axes={
                "speech": {
                    0: "B",
                    1: "T"
                },
                "speech_lengths": {
                    0: "B"
                },
                "decoder_out": {
                    0: "B",
                    1: "T_OUT"
                },
                "decoder_out_lens": {
                    0: "B"
                },
                "tp_alphas": {
                    0: "B",
                    1: "T_OUT1"
                },
            },
            verbose=True,
        ) 

关注一下cif那部分的转onnx

@Mddct

关注一下cif那部分的转onnx

是的,确实主要是CIF那块儿的问题;
我做了一些尝试,目前可以成功转换支持动态输入的 onnx-gpu 模型,但是还是存在一些问题。

  1. wenet-main/examples/aishell/paraformer/wenet/utils/mask.py
# 这里加item(),会导致onnx转换模型无法支持动态维度,用netron查看网络结构,会发现这里会变成一个固定值
# max_len = max_len if max_len > 0 else lengths.max().item()  
max_len = lengths.max() 
  1. wenet-main/examples/aishell/paraformer/wenet/paraformer/cif.py
class Cif(nn.Module):
    def forward():
        if target_length is None and self.tail_threshold > 0.0:

            # 这块儿好像也有点问题,会提示 int32 和 int64 不兼容的问题

            # token_num_int = torch.max(token_num).type(torch.int32).item()
            token_num_int = torch.max(token_num).type(torch.int64)
            acoustic_embeds = acoustic_embeds[:, :token_num_int, :]     
  1. wenet-main/examples/aishell/paraformer/wenet/paraformer/cif.py
    1. 尝试

      • 此处的CIF函数,主体使用的是for循环,直接导出的话,其层数会固定,与onnx转换时的层数一致
      • 增加 @torch.jit.script,虽然支持动态维度,但是得到的模型耗时严重
    2. 解决

    3. 现阶段问题

      • 模型推理耗时不稳定,对于padding到同一长度的音频文件(60s),for循环测试,耗时从150ms ~ 2000ms不等;
      • 性能方面有一定的损失;

第一部分会重构下这个函数 这个函数也会影响torch.compile等函数

第二部分 onnx 是支持for 循环导出的这里我有时间整一下, 你可以搜一下torch for loop to onnx; 或者可以对齐下parallel cif的实现 提个pr过来

@whisper-yu #2515 帮忙mask试下这个 🙏

第二部分 onnx 是支持for 循环导出的这里我有时间整一下, 你可以搜一下torch for loop to onnx; 或者可以对齐下parallel cif的实现 提个pr过来

感谢!那我这里先试一下onnx for循环导出的问题,看能不能解决;

但是推理耗时不稳定的问题,会是CIF这块儿的问题吗?

第二部分 onnx 是支持for 循环导出的这里我有时间整一下, 你可以搜一下torch for loop to onnx; 或者可以对齐下parallel cif的实现 提个pr过来

感谢!那我这里先试一下onnx for循环导出的问题,看能不能解决;

但是推理耗时不稳定的问题,会是CIF这块儿的问题吗?

应该是 其他结构都类transformer 推理应该很稳定

@Mddct

应该是 其他结构都类transformer 推理应该很稳定

我做了一些测试,但是跟我原本猜想的好像不太一样:

  • 实际上慢的部分是decoder(并且是时快时慢,反而encoder predictor部分的耗时很稳定);
    • 这块儿我经验不多,我有考虑是否是资源受限的问题,但感觉又不合理
  • 性能有一定的衰减,自有数据集上字错率由30%(funasr接口直测)涨到了36%,感觉可能 CIF没对齐的问题(因为主要更改了这个部分);

测试过程

我把 encoder predictor decoder 分别导出为 onnx-gpu模型,然后单独测试其耗时;

1. 测试结果【测试音频不同】

  • 耗时单位为 ms 毫秒
    image

2. 测试代码:

def infer_onnx(wav_path, model, tokenizer):

    # pre
    start_0 = time.time()
    wav, sr = torchaudio.load(wav_path)

    # padding
    padding_length = int(60 * sr - wav.shape[1])
    padding = torch.zeros(1, padding_length) + 0.00001
    wav = torch.cat([wav, padding], dim=1)

    data = wav.squeeze()
    data = [data]

    speech, speech_lengths = extract_fbank(data)
    # 这里没有将LFR放入 encoder,是因为 其中有算子不支持!!!
    lfr = LFR()
    speech, speech_lengths = lfr(speech, speech_lengths)
    end_0 = time.time()
    total_0 = int((end_0 - start_0) * 1000)


    # encoder
    start_1 = time.time()
    encoder_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lengths),
    }
    encoder_out, encoder_out_mask = encoder_session.run(None, encoder_inputs)
    end_1 = time.time()
    total_1 = int((end_1 - start_1) * 1000)


    # predictor
    start_2 = time.time()
    predictor_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
    }
    acoustic_embed, token_num, tp_alphas = predictor_session.run(None, predictor_inputs)    

    end_2 = time.time()
    total_2 = int((end_2 - start_2) * 1000)




    # decoder
    start_3 = time.time()
    decoder_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
        "acoustic_embed": acoustic_embed,
        "token_num": token_num,
    }
    decoder_out = decoder_session.run(None, decoder_inputs)
    decoder_out = decoder_out[0]

    end_3 = time.time()
    total_3 = int((end_3 - start_3) *  1000)



    # post
    start_4 = time.time()
    decoder_out = torch.tensor(decoder_out, dtype=torch.float32)
    decoder_out_lens = torch.tensor(token_num, dtype=torch.int32)
    tp_alphas = torch.tensor(tp_alphas, dtype=torch.float32)

    peaks = model.forward_cif_peaks(tp_alphas, decoder_out_lens)
    paraformer_greedy_result = paraformer_greedy_search(
        decoder_out, decoder_out_lens, peaks)
    
    results = {
        "paraformer_greedy_result": paraformer_greedy_result
    }

    for i in range(len(data)):
        for mode, hyps in results.items():
            tokens = hyps[i].tokens
            line = '{}'.format(tokenizer.detokenize(tokens)[0])

    end_4 = time.time()
    total_4 = int((end_4 - start_4) * 1000)

    print(f"[pre]-{total_0} ||[encoder]-{total_1} ||[predictor]-{total_2} ||[decoder]-{total_3} ||[post]-{total_4}")


    return line

3. decoder onnx 导出代码

# forward部分
class Paraformer(ASRModel):

    # DECODER
    @torch.jit.export
    def forward_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_out_mask: torch.Tensor,
        acoustic_embed: torch.Tensor,
        token_num: torch.Tensor
    ) -> torch.Tensor:
        
        # decoder
        decoder_out, _, _ = self.decoder(encoder_out, encoder_out_mask,
                                         acoustic_embed, token_num)
        decoder_out = decoder_out.log_softmax(dim=-1)

        return decoder_out  


##############################################################
# decoder  onnx模型导出部分
if not os.path.exists(decoder_path):
    print("\n\n[export decoder]")
    model.forward = model.forward_decoder
    torch.onnx.export(
        model,
        (encoder_out, encoder_out_mask, acoustic_embed, token_num),
        decoder_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["encoder_out", "encoder_out_mask", "acoustic_embed", "token_num"],
        output_names=[
            "decoder_out",
        ],
        dynamic_axes={
            "encoder_out": {
                0: "B",
                1: "T_E"
            },
            "encoder_out_mask": {
                0: "B",
                2: 'T_E'
            },
            "acoustic_embed":{
                0: "B",
                1: "T_P"                    
            },
            "token_num":{
                0: "B"                        
            },


            "decoder_out":{
                0: "B",
                1: "T_P"
            },               
        },
        verbose=True,
    )  




@Mddct

应该是 其他结构都类transformer 推理应该很稳定

我做了一些测试,但是跟我原本猜想的好像不太一样:

  • 实际上慢的部分是decoder(并且是时快时慢,反而encoder predictor部分的耗时很稳定);

    • 这块儿我经验不多,我有考虑是否是资源受限的问题,但感觉又不合理
  • 性能有一定的衰减,自有数据集上字错率由30%(funasr接口直测)涨到了36%,感觉可能 CIF没对齐的问题(因为主要更改了这个部分);

测试过程

我把 encoder predictor decoder 分别导出为 onnx-gpu模型,然后单独测试其耗时;

1. 测试结果【测试音频不同】

  • 耗时单位为 ms 毫秒
    image

2. 测试代码:

def infer_onnx(wav_path, model, tokenizer):

    # pre
    start_0 = time.time()
    wav, sr = torchaudio.load(wav_path)

    # padding
    padding_length = int(60 * sr - wav.shape[1])
    padding = torch.zeros(1, padding_length) + 0.00001
    wav = torch.cat([wav, padding], dim=1)

    data = wav.squeeze()
    data = [data]

    speech, speech_lengths = extract_fbank(data)
    # 这里没有将LFR放入 encoder,是因为 其中有算子不支持!!!
    lfr = LFR()
    speech, speech_lengths = lfr(speech, speech_lengths)
    end_0 = time.time()
    total_0 = int((end_0 - start_0) * 1000)


    # encoder
    start_1 = time.time()
    encoder_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lengths),
    }
    encoder_out, encoder_out_mask = encoder_session.run(None, encoder_inputs)
    end_1 = time.time()
    total_1 = int((end_1 - start_1) * 1000)


    # predictor
    start_2 = time.time()
    predictor_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
    }
    acoustic_embed, token_num, tp_alphas = predictor_session.run(None, predictor_inputs)    

    end_2 = time.time()
    total_2 = int((end_2 - start_2) * 1000)




    # decoder
    start_3 = time.time()
    decoder_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
        "acoustic_embed": acoustic_embed,
        "token_num": token_num,
    }
    decoder_out = decoder_session.run(None, decoder_inputs)
    decoder_out = decoder_out[0]

    end_3 = time.time()
    total_3 = int((end_3 - start_3) *  1000)



    # post
    start_4 = time.time()
    decoder_out = torch.tensor(decoder_out, dtype=torch.float32)
    decoder_out_lens = torch.tensor(token_num, dtype=torch.int32)
    tp_alphas = torch.tensor(tp_alphas, dtype=torch.float32)

    peaks = model.forward_cif_peaks(tp_alphas, decoder_out_lens)
    paraformer_greedy_result = paraformer_greedy_search(
        decoder_out, decoder_out_lens, peaks)
    
    results = {
        "paraformer_greedy_result": paraformer_greedy_result
    }

    for i in range(len(data)):
        for mode, hyps in results.items():
            tokens = hyps[i].tokens
            line = '{}'.format(tokenizer.detokenize(tokens)[0])

    end_4 = time.time()
    total_4 = int((end_4 - start_4) * 1000)

    print(f"[pre]-{total_0} ||[encoder]-{total_1} ||[predictor]-{total_2} ||[decoder]-{total_3} ||[post]-{total_4}")


    return line

3. decoder onnx 导出代码

# forward部分
class Paraformer(ASRModel):

    # DECODER
    @torch.jit.export
    def forward_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_out_mask: torch.Tensor,
        acoustic_embed: torch.Tensor,
        token_num: torch.Tensor
    ) -> torch.Tensor:
        
        # decoder
        decoder_out, _, _ = self.decoder(encoder_out, encoder_out_mask,
                                         acoustic_embed, token_num)
        decoder_out = decoder_out.log_softmax(dim=-1)

        return decoder_out  


##############################################################
# decoder  onnx模型导出部分
if not os.path.exists(decoder_path):
    print("\n\n[export decoder]")
    model.forward = model.forward_decoder
    torch.onnx.export(
        model,
        (encoder_out, encoder_out_mask, acoustic_embed, token_num),
        decoder_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["encoder_out", "encoder_out_mask", "acoustic_embed", "token_num"],
        output_names=[
            "decoder_out",
        ],
        dynamic_axes={
            "encoder_out": {
                0: "B",
                1: "T_E"
            },
            "encoder_out_mask": {
                0: "B",
                2: 'T_E'
            },
            "acoustic_embed":{
                0: "B",
                1: "T_P"                    
            },
            "token_num":{
                0: "B"                        
            },


            "decoder_out":{
                0: "B",
                1: "T_P"
            },               
        },
        verbose=True,
    )  

Is this issue fixed?