macroustc's repositories
Amphion
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
audino
Open source audio annotation tool for humans
Awesome-Talking-Face
📖 A curated list of resources dedicated to talking face.
Awesome-Text-to-Image
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
Awesome-Video-Diffusion-Models
[Arxiv] A Survey on Video Diffusion Models
Bert-VITS2
vits2 backbone with bert
ChatTTS
ChatTTS is a generative speech model for daily dialogue.
DeepLearningSystem
Deep Learning System core principles introduction.
Diff-Foley
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
diffusers
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
EmotiVoice
EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
fish-speech
Brand new TTS solution
GPT-SoVITS
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
jepa
PyTorch code and models for V-JEPA self-supervised learning from video.
llm-paper-daily
Daily updated LLM papers. 每日更新 LLM 相关的论文,欢迎订阅 👏 喜欢的话动动你的小手 🌟 一个
minisora
The Mini Sora project aims to explore the implementation path and future development direction of Sora.
Open-Sora
Building your own video generation model like OpenAI's Sora
Open-Sora-Plan
This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open source community can contribute to this project.
phonemizer
Simple text to phones converter for multiple languages
piper
A fast, local neural text to speech system
Qwen-Audio
The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.
Qwen-VL
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
seamless_communication
Foundational Models for State-of-the-Art Speech and Text Translation
SLAM-LLM
Speech, Language, Audio, Music Processing with Large Language Model
StyleTTS2
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
UniAudio
The Open Source Code of UniAudio
VoiceCraft
Zero-Shot Speech Editing and Text-to-Speech in the Wild
yt-dlp
A feature-rich command-line audio/video downloader