🔱 Speech Trident - Awesome Speech LM

In this repository, we survey three crucial areas: (1) representation learning, (2) neural codec, and (3) language models that contribute to speech/audio large language models.

1.⚡ Speech Representation Models: These models focus on learning structural speech representations, which can then be quantized into discrete speech tokens, often refer to semantic tokens.

2.⚡ Speech Neural Codec Models: These models are designed to learn speech and audio discrete tokens, often referred to as acoustic tokens, while maintaining reconstruction ability and low bitrate.

3.⚡ Speech Large Language Models: These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation.

🔱 Contributors

_{Kai-Wei Chang}	_{Haibin Wu}	_{Wei-Cheng Tseng}
_{Kehan Lu}	_{Chun-Yi Kuan}	_{Hung-yi Lee}

🔱 Speech/Audio Language Models

Date	Model Name	Paper Title	Link
2024-04	WavLLM	WavLLM: Towards Robust and Adaptive Speech Large Language Model	paper
2024-02	SLAM-ASR	An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	paper
2024-02	AnyGPT	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	paper
2024-02	SpiRit-LM	SpiRit-LM: Interleaved Spoken and Written Language Model	paper
2024-02	BAT	BAT: Learning to Reason about Spatial Sounds with Large Language Models	paper
2024-02	Audio Flamingo	Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities	paper
2024-02	Text Description to speech	Natural language guidance of high-fidelity text-to-speech with synthetic annotations	paper
2024-02	GenTranslate	GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	paper
2024-02	Base-TTS	BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data	paper
2024-02	--	It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition	paper
2024-01	--	Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	paper
2023-12	Seamless	Seamless: Multilingual Expressive and Streaming Speech Translation	paper
2023-11	Qwen-Audio	Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	paper
2023-10	LauraGPT	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	paper
2023-10	SALMONN	SALMONN: Towards Generic Hearing Abilities for Large Language Models	paper
2023-10	UniAudio	UniAudio: An Audio Foundation Model Toward Universal Audio Generation	paper
2023-10	Whispering LLaMA	Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition	paper
2023-09	VoxtLM	Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks	paper
2023-09	LTU-AS	Joint Audio and Speech Understanding	paper
2023-09	SLM	SLM: Bridge the thin gap between speech and text foundation models	paper
2023-09	--	Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting	paper
2023-08	SpeechGen	SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts	paper
2023-08	SpeechX	SpeechX: Neural Codec Language Model as a Versatile Speech Transformer	paper
2023-08	LLaSM	Large Language and Speech Model	paper
2023-08	SeamlessM4T	Massively Multilingual & Multimodal Machine Translation	paper
2023-07	Speech-LLaMA	On decoder-only architecture for speech-to-text and large language model integration	paper
2023-07	LLM-ASR(temp.)	Prompting Large Language Models with Speech Recognition Abilities	paper
2023-06	AudioPaLM	AudioPaLM: A Large Language Model That Can Speak and Listen	paper
2023-05	Spectron	Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM	paper
2023-05	TWIST	Textually Pretrained Speech Language Models	paper
2023-05	Pengi	Pengi: An Audio Language Model for Audio Tasks	paper
2023-05	SoundStorm	Efficient Parallel Audio Generation	paper
2023-05	LTU	Joint Audio and Speech Understanding	paper
2023-05	SpeechGPT	Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	paper
2023-05	VioLA	Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	paper
2023-05	X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	paper
2023-03	Google USM	Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages	paper
2023-03	VALL-E X	Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	paper
2023-02	SPEAR-TTS	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision	paper
2023-01	VALL-E	Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	paper
2022-12	Whisper	Robust Speech Recognition via Large-Scale Weak Supervision	paper
2022-10	AudioGen	AudioGen: Textually Guided Audio Generation	paper
2022-09	AudioLM	AudioLM: a Language Modeling Approach to Audio Generation	paper
2022-05	Wav2Seq	Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages	paper
2022-04	Unit mBART	Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation	paper
2022-03	d-GSLM	Generative Spoken Dialogue Language Modeling	paper
2021-10	SLAM	SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training	paper
2021-09	p-GSLM	Text-Free Prosody-Aware Generative Spoken Language Modeling	paper
2021-02	GSLM	Generative Spoken Language Modeling from Raw Audio	paper

🔱 Speech/Audio Representation Models

Date	Model Name	Paper Title	Link
2024-01	EAT	Self-Supervised Pre-Training with Efficient Audio Transformer	paper
2023-10	MR-HuBERT	Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction	paper
2023-10	SpeechFlow	Generative Pre-training for Speech with Flow Matching	paper
2023-09	WavLabLM	Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning	paper
2023-08	W2v-BERT 2.0	Massively Multilingual & Multimodal Machine Translation	paper
2023-07	Whisper-AT	Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers	paper
2023-06	ATST	Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks	paper
2023-05	SPIN	Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering	paper
2023-05	DinoSR	Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning	paper
2023-05	NFA	Self-supervised neural factor analysis for disentangling utterance-level speech representations	paper
2022-12	Data2vec 2.0	Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language	paper
2022-12	BEATs	Audio Pre-Training with Acoustic Tokenizers	paper
2022-11	MT4SSL	MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets	paper
2022-08	DINO	Non-contrastive self-supervised learning of utterance-level speech representations	paper
2022-07	Audio-MAE	Masked Autoencoders that Listen	paper
2022-04	MAESTRO	Matched Speech Text Representations through Modality Matching	paper
2022-03	MAE-AST	Masked Autoencoding Audio Spectrogram Transformer	paper
2022-03	LightHuBERT	Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT	paper
2022-02	Data2vec	A General Framework for Self-supervised Learning in Speech, Vision and Language	paper
2021-10	WavLM	WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	paper
2021-08	W2v-BERT	Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training	paper
2021-07	mHuBERT	Direct speech-to-speech translation with discrete units	paper
2021-06	HuBERT	Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units	paper
2021-03	BYOL-A	Self-Supervised Learning for General-Purpose Audio Representation	paper
2020-12	DeCoAR2.0	DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization	paper
2020-07	TERA	TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech	paper
2020-06	Wav2vec2.0	wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	paper
2019-10	APC	Generative Pre-Training for Speech with Autoregressive Predictive Coding	paper
2018-07	CPC	Representation Learning with Contrastive Predictive Coding	paper

🔱 Speech/Audio Codec Models

Date	Model Name	Paper Title	Link
2024-05	HILCodec	HILCodec: High Fidelity and Lightweight Neural Audio Codec	paper
2024-04	SemantiCodec	SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound	paper
2024-03	FACodec	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	paper
2024-02	Language-Codec	Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models	paper
2024-01	ScoreDec	ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter	paper
2023-11	HierSpeech++	HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	paper
2023-09	FunCodec	FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	paper
2023-08	SpeechTokenizer	Speechtokenizer: Unified speech tokenizer for speech large language models	paper
2023-06	Descript-audio-codec	High-Fidelity Audio Compression with Improved RVQGAN	paper
2023-05	AudioDec	Audiodec: An open-source streaming highfidelity neural audio codec	paper
2023-05	HiFi-Codec	Hifi-codec: Group-residual vector quantization for high fidelity audio codec	paper
2023-03	LMCodec	LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models	paper
2022-10	EnCodec	High fidelity neural audio compression	paper
2021-07	SoundStream	SoundStream: An End-to-End Neural Audio Codec	paper

🔱 ICASSP 2024 Tutorial Information

I (Kai-Wei Chang) will be giving a talk as part of the ICASSP 2024 tutorial titled Parameter-Efficient and Prompt Learning for Speech and Language Foundation Models. The topic will cover nowday's speech/audio large language models.

Tutorial speakers:

Dr. Huck Yang (NVIDIA)
Dr. Pin-Yu Chen (IBM Research)
Prof. Hung-yi Lee (National Taiwan University)
Kai-Wei Chang (National Taiwan University)
Cheng-Han Chiang (National Taiwan University)

See you in Seoul!

🔱 Update: The Tutorial was successfully conducted at ICASSP 2024. Thank all attendees for their participation. The slides from my presentation is available at https://kwchang.org/talks/. Please feel free to reach out to me for any discussions.

🔱 Related Repository

Citation

If you find this repository useful, please consider citing the following papers.

@article{wu2024codec,
  title={Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author={Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13071},
  year={2024}
}

@article{wu2024towards,
  title={Towards audio language modeling-an overview},
  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kai-wei and Chung, Ho-Lam and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13236},
  year={2024}
}

About

Awesome speech/audio LLMs, representation learning, and codec models