wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit

Home Page:https://wenet-e2e.github.io/wenet/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Next WeNet Roadmap

robin1001 opened this issue · comments

We will mainly focus on the following two problems in Next WeNet.

  1. NN based contextual biasing and LM solution. On the one hand, a pure end-to-end model is our final goal, including contextual biasing and LM. On the other hand, there are a lot of problems in our current contextual biasing and LM, such as poor rare word performance in contextual biasing, complicated LM solution since FST and token passing beam search are introduced, and so on. Also, we are looking for new paradigm, such as joint text/audio learning, prompt learning, and so on.
  2. Open source big model, pretrained model, and mutimodal model exploration. We can see the increasing capability, influence, and interest in these models, and we believe it may give a final solution to general AI. It's hard for us to directly do such things due to the lack of research and computation resources. However, we can explore the usage of the models in speech recognition applications as open source big models + task/private data may be the new paradigm for the next AI.

We are open for other proposals. WeNet is a community-driven project and we love your feedback and proposals on where we should be heading. Feel free to volunteer yourself if you are interested in trying out some items(they do not have to be on the list).

From Google's recent USM paper, we can see the following three points:

1 injecting tezt

2 Simpler pre-training

3 Text to speech intermediate representation

I think these three are the ultimate weapons for speech recognition, whether it is from the signal level or the text level。

And the community is a good way to cooperate to make the big model or the road of the new pipeline

From Google's recent USM paper, we can see the following three points:

1 injecting tezt

2 Simpler pre-training

3 Text to speech intermediate representation

I think these three are the ultimate weapons for speech recognition, whether it is from the signal level or the text level。

And the community is a good way to cooperate to make the big model or the road of the new pipeline

For 2: sipmpler pretrin: May be bestrq is good start : https://github.com/wenet-e2e/wenet/tree/Mddct-bestrq/wenet/ssl/bestrq

@Mddct shows his insight on general speech recognition task, it's great.

This issue has been automatically closed due to inactivity.