k2-fsa / sherpa-onnx

Hello,there! i have looked the https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py it's a nice try to use whisper model in different platform by using ONNX format.
I'm trying to convert it to rknn format(a model type in RKs device) to use rknpu. But i meet some obstacles.

1.I use the export-onnx.py export encoder and decoder successfully.

2.I try to build a script to convert onnx to rknn. By netron i can see the structure of onnx.
The input seems to be a dynamic shape.[n_audio,80,T],So i use dynamic_input in exporting rknn(code below),Then i successfully export it(encoder.rknn).
But when i run it in RK3568 device ,somethings go wrong.
Do you have interest in convert to rknn? Hope you can give me some advice.

from rknn.api import RKNN
import os
import onnx
import sys

dynamic_input=[
    [[1,80,3000]]
]

model_path='base-models/base-encoder.onnx'
model = onnx.load(model_path)
onnx.checker.check_model(model)
print("The model is checked!")


# Create RKNN object
rknn = RKNN(verbose=False)

# Pre-process config
print('--> Config model')
rknn.config(target_platform='rk3568',
            dynamic_input=dynamic_input,
            )
print('done')

# Load model
print('--> Loading model')
ret = rknn.load_onnx(model=model_path)
if ret != 0:
    print('Load model failed!')
    exit(ret)
print('done')

 # Build model
print('--> Building model')
ret = rknn.build(do_quantization=False)
if ret != 0:
    print('Build model failed!')
    exit(ret)
print('done')


# Export rknn model
print('--> Export rknn model')
ret = rknn.export_rknn('base-encoder-3568-int8.rknn',gen_cpp_demo=True)
if ret != 0:
    print('Export rknn model failed!')
    exit(ret)
print('done')

# 
rknn.release()

rknn.py:

import cv2
import numpy as np
import platform
from rknnlite.api import RKNNLite
import argparse
import base64
from typing import Tuple

import kaldi_native_fbank as knf
import onnxruntime as ort
import torch
import torchaudio

from test import *

# get current platform Structure
DEVICE_COMPATIBLE_NODE = '/proc/device-tree/compatible'

def get_host():
    # get platform and device type
    system = platform.system()
    machine = platform.machine()
    os_machine = system + '-' + machine
    if os_machine == 'Linux-aarch64':
        try:
            with open(DEVICE_COMPATIBLE_NODE) as f:
                device_compatible_str = f.read()
                # print(device_compatible_str)
                # SAMPLES : embedfire,lubancat-2-v2rockchip,rk3568
                if 'rk3562' in device_compatible_str:
                    host = 'RK3562'
                elif 'rk3576' in device_compatible_str:
                    host = 'RK3576'
                elif 'rk3588' in device_compatible_str:
                    host = 'RK3588'
                else:
                    host = 'RK3566_RK3568'
        except IOError:
            print('Read device node {} failed.'.format(DEVICE_COMPATIBLE_NODE))
            exit(-1)
    else:
        host = os_machine
    return host

Model_path='base-encoder-rk3568.rknn'

sound_file='1s.wav'
Decoder_path='tiny-decoder.onnx'
Encoder_path='tiny-encoder.onnx'
Tokens='tiny-tokens.txt'

mel = compute_features(sound_file)

print(mel.shape)

np_mel=mel.numpy()
print(type(np_mel))
print(np_mel.shape)

#model = OnnxModel(Encoder_path,Decoder_path)
#n_layer_cross_k,n_layer_cross_v=model.run_encoder(mel)

#print(type(n_layer_cross_k),type(n_layer_cross_v))

# print(n_layer_cross_k.shape,n_layer_cross_v.shape)


#host_name=get_host()
#print(host_name)
rknn_lite = RKNNLite()
# load rknn model
ret = rknn_lite.load_rknn(Model_path)
if ret !=0:
    print('Load rknn model failed')
    exit(ret)
print('Done!')

# Init runtime environment
print('--> Init runtime environment')
# Run on RK356x / RK3576 / RK3588 with Debian OS, do not need specify target.
ret = rknn_lite.init_runtime()
if ret != 0:
    print('Init runtime environment failed')
    exit(ret)
print('done')


# Inference
print('--> Running model')
outputs=rknn_lite.inference(inputs=[np_mel])

print(type(outputs))

Please remove

sherpa-onnx/scripts/whisper/export-onnx.py

Line 104 in 69347ff

AudioEncoder.forward = modified_audio_encoder_forward

and

sherpa-onnx/scripts/whisper/export-onnx.py

Lines 416 to 420 in 69347ff

    
           dynamic_axes={ 
        
               "mel": {0: "n_audio", 2: "T"},  # n_audio is also known as batch_size 
        
               "n_layer_cross_k": {1: "n_audio", 2: "T"}, 
        
               "n_layer_cross_v": {1: "n_audio", 2: "T"}, 
        
           },

and

sherpa-onnx/scripts/whisper/export-onnx.py

Lines 540 to 546 in 69347ff

    
           dynamic_axes={ 
        
               "tokens": {0: "n_audio", 1: "n_tokens"}, 
        
               "in_n_layer_self_k_cache": {1: "n_audio"}, 
        
               "in_n_layer_self_v_cache": {1: "n_audio"}, 
        
               "n_layer_cross_k": {1: "n_audio", 2: "T"}, 
        
               "n_layer_cross_v": {1: "n_audio", 2: "T"}, 
        
           },

And retry.

Thanks for your attention. i find a part is distinct.
I check my yolov8-pose which successfully use rknn,and i open the verbose when init rknn_runtime, you can see firstlayer all information is complete. but in whisper the first layer DataFormat is missing. I don't know whether this causes Aborted.

yolov8-pose verbose:

Netron show:

whisper verbose:

whisper Netron show:

whisper uses a 3-d input, not a 4-d.

Yes , i use a random np.ndarry with shape:(1,80,3000) to test rknn, actually still aborted. i'm thinking whether the "DataFormat" int first line causes abort.

Please try to convert the model layer by layer and also debug it layer by layer.

This is issue is out of scope of sherpa-onnx.

Is the problem solved?

	dynamic_axes={
	"mel": {0: "n_audio", 2: "T"}, # n_audio is also known as batch_size
	"n_layer_cross_k": {1: "n_audio", 2: "T"},
	"n_layer_cross_v": {1: "n_audio", 2: "T"},
	},

	dynamic_axes={
	"tokens": {0: "n_audio", 1: "n_tokens"},
	"in_n_layer_self_k_cache": {1: "n_audio"},
	"in_n_layer_self_v_cache": {1: "n_audio"},
	"n_layer_cross_k": {1: "n_audio", 2: "T"},
	"n_layer_cross_v": {1: "n_audio", 2: "T"},
	},

whisper onnx convert to rknn