yinjake / SALMONN

SALMONN: Speech Audio Language Music Open Neural Network

Home Page:https://bytedance.github.io/SALMONN/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SALMONN: Speech Audio Language Music Open Neural Network

Welcome to the repo of SALMONN!

SALMONN is a large language model (LLM) enabling speech, audio event, and music inputs, which is created by the Department of the Electronic Engineering of Tsinghua University and ByteDance. Instead of speech-only input or audio-event-only input, SALMONN can perceive and understand all kinds of audio inputs and therefore obtains emerging capabilities such as multilingual speech recognition & translation and audio-speech reasoning. This can be regarded as giving the LLM "ears" to have cognitive hearing abilities, which makes SALMONN a step towards hearing-enabled artificial general intelligence.

We will open source the code and the model checkpoint soon. Stay tuned!

Structure

SALMONN adopts a speech & audio encoder to encode generic audio representation, then uses an audio-text aligner to map the audio feature into textual space. Finally, the large language model answers based on the textual prompt and the auditory tokens.

Demos

Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, SALMONN leverages the general knowledge and cognitive abilities of the LLM to achieve a cognitively oriented audio perception, which dramatically improves the versatility of the model and the richness of the task. In addition, SALMONN is able to follow textual commands, and even spoken commands, with a relatively high degree of accuracy. Since SALMONN only uses training data based on textual commands, listening to spoken commands is also a cross-modal emergent ability.

Here are some demos of SALMONN.

Audio Response
asr.wav asr
audiocaption.wav audiocaption
music.wav music
emotion.wav emotion
asr_en2de.wav asr_en2de
keywords.flac keywords
spoken_query.wav spoken_query
audio_story_telling.wav audio_story_telling
spoken_audio_query.wav spoken_audio_query

Team

Team Tsinghua: Wenyi Yu, Changli Tang, Guangzhi Sun, Chao Zhang

Team ByteDance: Xianzhao Chen, Wei Li, Tian Tan, Lu Lu, Zejun Ma

About

SALMONN: Speech Audio Language Music Open Neural Network

https://bytedance.github.io/SALMONN/

License:Apache License 2.0


Languages

Language:HTML 100.0%