ga642381 / speech-trident

Awesome speech/audio LLMs, representation learning, and codec models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

πŸ”± Speech Trident - Awesome Speech LM

Speech Trident

In this repository, we survey three crucial areas: (1) representation learning, (2) neural codec, and (3) language models that contribute to speech/audio large language models.

1.⚑ Speech Representation Models: These models focus on learning structural speech representations, which can then be quantized into discrete speech tokens, often refer to semantic tokens.

2.⚑ Speech Neural Codec Models: These models are designed to learn speech and audio discrete tokens, often referred to as acoustic tokens, while maintaining reconstruction ability and low bitrate.

3.⚑ Speech Large Language Models: These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation.

πŸ”± Contributors


Kai-Wei Chang

Haibin Wu

Wei-Cheng Tseng

Kehan Lu

Chun-Yi Kuan

Hung-yi Lee

πŸ”± Speech/Audio Language Models

Date Model Name Paper Title Link
2024-04 WavLLM WavLLM: Towards Robust and Adaptive Speech Large Language Model paper
2024-02 SLAM-ASR An Embarrassingly Simple Approach for LLM with Strong ASR Capacity paper
2024-02 AnyGPT AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling paper
2024-02 SpiRit-LM SpiRit-LM: Interleaved Spoken and Written Language Model paper
2024-02 BAT BAT: Learning to Reason about Spatial Sounds with Large Language Models paper
2024-02 Audio Flamingo Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities paper
2024-02 Text Description to speech Natural language guidance of high-fidelity text-to-speech with synthetic annotations paper
2024-02 GenTranslate GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators paper
2024-02 Base-TTS BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data paper
2024-02 -- It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition paper
2024-01 -- Large Language Models are Efficient Learners of Noise-Robust Speech Recognition paper
2023-12 Seamless Seamless: Multilingual Expressive and Streaming Speech Translation paper
2023-11 Qwen-Audio Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models paper
2023-10 LauraGPT LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT paper
2023-10 SALMONN SALMONN: Towards Generic Hearing Abilities for Large Language Models paper
2023-10 UniAudio UniAudio: An Audio Foundation Model Toward Universal Audio Generation paper
2023-10 Whispering LLaMA Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition paper
2023-09 VoxtLM Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks paper
2023-09 LTU-AS Joint Audio and Speech Understanding paper
2023-09 SLM SLM: Bridge the thin gap between speech and text foundation models paper
2023-09 -- Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting paper
2023-08 SpeechGen SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts paper
2023-08 SpeechX SpeechX: Neural Codec Language Model as a Versatile Speech Transformer paper
2023-08 LLaSM Large Language and Speech Model paper
2023-08 SeamlessM4T Massively Multilingual & Multimodal Machine Translation paper
2023-07 Speech-LLaMA On decoder-only architecture for speech-to-text and large language model integration paper
2023-07 LLM-ASR(temp.) Prompting Large Language Models with Speech Recognition Abilities paper
2023-06 AudioPaLM AudioPaLM: A Large Language Model That Can Speak and Listen paper
2023-05 Spectron Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM paper
2023-05 TWIST Textually Pretrained Speech Language Models paper
2023-05 Pengi Pengi: An Audio Language Model for Audio Tasks paper
2023-05 SoundStorm Efficient Parallel Audio Generation paper
2023-05 LTU Joint Audio and Speech Understanding paper
2023-05 SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities paper
2023-05 VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation paper
2023-05 X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages paper
2023-03 Google USM Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages paper
2023-03 VALL-E X Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling paper
2023-02 SPEAR-TTS Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision paper
2023-01 VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers paper
2022-12 Whisper Robust Speech Recognition via Large-Scale Weak Supervision paper
2022-10 AudioGen AudioGen: Textually Guided Audio Generation paper
2022-09 AudioLM AudioLM: a Language Modeling Approach to Audio Generation paper
2022-05 Wav2Seq Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages paper
2022-04 Unit mBART Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation paper
2022-03 d-GSLM Generative Spoken Dialogue Language Modeling paper
2021-10 SLAM SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training paper
2021-09 p-GSLM Text-Free Prosody-Aware Generative Spoken Language Modeling paper
2021-02 GSLM Generative Spoken Language Modeling from Raw Audio paper

πŸ”± Speech/Audio Representation Models

Date Model Name Paper Title Link
2024-01 EAT Self-Supervised Pre-Training with Efficient Audio Transformer paper
2023-10 MR-HuBERT Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction paper
2023-10 SpeechFlow Generative Pre-training for Speech with Flow Matching paper
2023-09 WavLabLM Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning paper
2023-08 W2v-BERT 2.0 Massively Multilingual & Multimodal Machine Translation paper
2023-07 Whisper-AT Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers paper
2023-06 ATST Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks paper
2023-05 SPIN Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering paper
2023-05 DinoSR Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning paper
2023-05 NFA Self-supervised neural factor analysis for disentangling utterance-level speech representations paper
2022-12 Data2vec 2.0 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language paper
2022-12 BEATs Audio Pre-Training with Acoustic Tokenizers paper
2022-11 MT4SSL MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets paper
2022-08 DINO Non-contrastive self-supervised learning of utterance-level speech representations paper
2022-07 Audio-MAE Masked Autoencoders that Listen paper
2022-04 MAESTRO Matched Speech Text Representations through Modality Matching paper
2022-03 MAE-AST Masked Autoencoding Audio Spectrogram Transformer paper
2022-03 LightHuBERT Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT paper
2022-02 Data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language paper
2021-10 WavLM WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing paper
2021-08 W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training paper
2021-07 mHuBERT Direct speech-to-speech translation with discrete units paper
2021-06 HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units paper
2021-03 BYOL-A Self-Supervised Learning for General-Purpose Audio Representation paper
2020-12 DeCoAR2.0 DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization paper
2020-07 TERA TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech paper
2020-06 Wav2vec2.0 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper
2019-10 APC Generative Pre-Training for Speech with Autoregressive Predictive Coding paper
2018-07 CPC Representation Learning with Contrastive Predictive Coding paper

πŸ”± Speech/Audio Codec Models

Date Model Name Paper Title Link
2024-05 HILCodec HILCodec: High Fidelity and Lightweight Neural Audio Codec paper
2024-04 SemantiCodec SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound paper
2024-03 FACodec NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models paper
2024-02 Language-Codec Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models paper
2024-01 ScoreDec ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter paper
2023-11 HierSpeech++ HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis paper
2023-09 FunCodec FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec paper
2023-08 SpeechTokenizer Speechtokenizer: Unified speech tokenizer for speech large language models paper
2023-06 Descript-audio-codec High-Fidelity Audio Compression with Improved RVQGAN paper
2023-05 AudioDec Audiodec: An open-source streaming highfidelity neural audio codec paper
2023-05 HiFi-Codec Hifi-codec: Group-residual vector quantization for high fidelity audio codec paper
2023-03 LMCodec LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models paper
2022-10 EnCodec High fidelity neural audio compression paper
2021-07 SoundStream SoundStream: An End-to-End Neural Audio Codec paper

πŸ”± ICASSP 2024 Tutorial Information

I (Kai-Wei Chang) will be giving a talk as part of the ICASSP 2024 tutorial titled Parameter-Efficient and Prompt Learning for Speech and Language Foundation Models. The topic will cover nowday's speech/audio large language models.

Tutorial speakers:

  • Dr. Huck Yang (NVIDIA)
  • Dr. Pin-Yu Chen (IBM Research)
  • Prof. Hung-yi Lee (National Taiwan University)
  • Kai-Wei Chang (National Taiwan University)
  • Cheng-Han Chiang (National Taiwan University)

See you in Seoul!

πŸ”± Update: The Tutorial was successfully conducted at ICASSP 2024. Thank all attendees for their participation. The slides from my presentation is available at https://kwchang.org/talks/. Please feel free to reach out to me for any discussions.

πŸ”± Related Repository

Citation

If you find this repository useful, please consider citing the following papers.

@article{wu2024codec,
  title={Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author={Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13071},
  year={2024}
}
@article{wu2024towards,
  title={Towards audio language modeling-an overview},
  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kai-wei and Chung, Ho-Lam and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13236},
  year={2024}
}

About

Awesome speech/audio LLMs, representation learning, and codec models