pineking

Qingsong Liu's repositories

AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Alibaba DAMO Academy.

Language:C++Apache-2.0010

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Language:PythonMIT010

AniPortrait

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Language:PythonApache-2.0010

Monkey

Monkey (LMM); 多模态大模型华科小猴子

Language:PythonMIT010

catvision

A multimodal large-scale model, which performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.

Language:Python010

ChatTTS

ChatTTS is a generative speech model for daily dialogue.

Language:Jupyter NotebookNOASSERTION000

CMMMU

Language:PythonApache-2.0010

CosyVoice

LLM based TTS model, providing inference/training/deployment full-stack ability.

Language:PythonApache-2.0000

DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Language:PythonMIT000

descript-audio-codec

State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.

MIT000

docker_image_pusher

使用Github Action将国外的Docker镜像转存到阿里云私有仓库，供国内服务器使用，免费易用

Apache-2.0000

dreamtalk

Official implementations for paper: DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

Language:PythonMIT010

E2STR

The official code for the CVPR 2024 paper: Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Language:PythonApache-2.0010

fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Language:PythonMIT000

FiT

FiT: Flexible Vision Transformer for Diffusion Model

Apache-2.0010

FunCodec

FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al.

Language:PythonMIT000

Glyph-ByT5

[ECCV2024] This is an official inference code of the paper "Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering" and "Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering""

Apache-2.0000

LLM-groundedDiffusion

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models (LLM-grounded Diffusion: LMD)

Language:Python010

Medical-SAM2

Medical SAM 2: Segment Medical Images As Video Via Segment Anything Model 2

Language:PythonApache-2.0000

mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

Language:PythonApache-2.0000

MultimodalOCR

On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)

Language:Python010

Open-AnimateAnyone

Unofficial Implementation of Animate Anyone

Language:Python010

PhotoMaker

Language:Jupyter NotebookNOASSERTION010

Qwen2-VL-Finetune

An open-source implementaion for fine-tuning Qwen2-VL-2B and Qwen2-VL-7B.

Apache-2.0000

RecordRTC

RecordRTC is WebRTC JavaScript library for audio/video as well as screen activity recording. It supports Chrome, Firefox, Opera, Android, and Microsoft Edge. Platforms: Linux, Mac and Windows.

Language:JavaScriptMIT000

seed-tts-eval

Language:Python000

SenseVoice

Multilingual Voice Understanding Model

MIT000

vocos

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Language:PythonMIT000

whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation

MIT000

xtts-api-server

A simple FastAPI Server to run XTTSv2

MIT000