14.11 |
Mustango: Toward Controllable Text-to-Music Generation |
arXiv |
GitHub |
Hugging Face |
13.11 |
Music ControlNet: Multiple Time-varying Controls for Music Generation |
arXiv |
- |
- |
02.11 |
E3 TTS: Easy End-to-End Diffusion-based Text to Speech |
arXiv |
- |
- |
01.10 |
UniAudio: An Audio Foundation Model Toward Universal Audio Generation |
arXiv |
GitHub |
- |
24.09 |
VoiceLDM: Text-to-Speech with Environmental Context |
arXiv |
GitHub |
- |
05.09 |
PromptTTS 2: Describing and Generating Voices with Text Prompt |
arXiv |
- |
- |
14.08 |
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer |
arXiv |
- |
- |
10.08 |
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining |
arXiv |
GitHub |
Hugging Face |
09.08 |
JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models |
arXiv |
- |
- |
03.08 |
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies |
arXiv |
GitHub |
- |
14.07 |
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts |
arXiv |
- |
- |
10.07 |
VampNet: Music Generation via Masked Acoustic Token Modeling |
arXiv |
GitHub |
- |
22.06 |
AudioPaLM: A Large Language Model That Can Speak and Listen |
arXiv |
- |
- |
19.06 |
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale |
PDF |
GitHub |
- |
08.06 |
MusicGen: Simple and Controllable Music Generation |
arXiv |
GitHub |
Hugging Face Colab |
06.06 |
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias |
arXiv |
- |
- |
01.06 |
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis |
arXiv |
GitHub |
- |
29.05 |
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation |
arXiv |
- |
- |
25.05 |
MeLoDy: Efficient Neural Music Generation |
arXiv |
- |
- |
18.05 |
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training |
arXiv |
- |
- |
18.05 |
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities |
arXiv |
GitHub |
- |
16.05 |
SoundStorm: Efficient Parallel Audio Generation |
arXiv |
GitHub (unofficial) |
- |
03.05 |
Diverse and Vivid Sound Generation from Text Descriptions |
arXiv |
- |
- |
02.05 |
Long-Term Rhythmic Video Soundtracker |
arXiv |
GitHub |
- |
24.04 |
TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model |
PDF |
GitHub |
Hugging Face |
18.04 |
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers |
arXiv |
GitHub (unofficial) |
- |
10.04 |
Bark: Text-Prompted Generative Audio Model |
- |
GitHub |
Hugging Face Colab |
03.04 |
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models |
arXiv |
- |
- |
08.03 |
VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling |
arXiv |
- |
- |
27.02 |
I Hear Your True Colors: Image Guided Audio Generation |
arXiv |
GitHub |
- |
08.02 |
Noise2Music: Text-conditioned Music Generation with Diffusion Models |
arXiv |
- |
- |
04.02 |
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation |
arXiv |
GitHub |
- |
30.01 |
SingSong: Generating musical accompaniments from singing |
arXiv |
- |
- |
30.01 |
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models |
arXiv |
GitHub |
Hugging Face |
30.01 |
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion |
arXiv |
GitHub |
- |
29.01 |
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models |
PDF |
- |
- |
28.01 |
Noise2Music |
- |
- |
- |
27.01 |
RAVE2 [Samples RAVE1] |
arXiv |
GitHub |
- |
26.01 |
MusicLM: Generating Music From Text |
arXiv |
GitHub (unofficial) |
- |
18.01 |
Msanii: High Fidelity Music Synthesis on a Shoestring Budget |
arXiv |
GitHub |
Hugging Face Colab |
16.01 |
ArchiSound: Audio Generation with Diffusion |
arXiv |
GitHub |
- |
05.01 |
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers |
arXiv |
GitHub (unofficial) (demo) |
- |