pineking

Qingsong Liu's starred repositories

chameleon

Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.

Language:PythonNOASSERTION158500

mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

Language:PythonApache-2.0406500

RecordRTC

RecordRTC is WebRTC JavaScript library for audio/video as well as screen activity recording. It supports Chrome, Firefox, Opera, Android, and Microsoft Edge. Platforms: Linux, Mac and Windows.

Language:JavaScriptMIT648200

Recorder

html5 js 录音 mp3 wav ogg webm amr g711a g711u 格式，支持pc和Android、iOS部分浏览器、Hybrid App（提供Android iOS App源码）、微信，提供ASR语音识别转文字 H5版语音通话聊天示例 DTMF编码解码

Language:JavaScriptMIT463400

pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Language:Jupyter NotebookMIT558600

pipecat

Open Source framework for voice and multimodal conversational AI

Language:PythonBSD-2-Clause248900

VoiceStreamAI

Near-Realtime audio transcription using self-hosted Whisper and WebSocket in Python/JS

Language:PythonMIT58100

selfservicekiosk-audio-streaming

A best practice for streaming audio from a browser microphone to Dialogflow or Google Cloud STT by using websockets.

Language:JavaScriptApache-2.013800

Awesome-Speaker-Diarization

Some comprehensive papers about speaker diarization

16500

insanely-fast-whisper

Language:Jupyter NotebookApache-2.0703500

Languagecodec

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

Language:PythonMIT18300

SpeechTokenizer

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on

Language:PythonApache-2.037800

AcademiCodec

AcademiCodec: An Open Source Audio Codec Model for Academic Research

Language:Python53700

vector-quantize-pytorch

Vector (and Scalar) Quantization, in Pytorch

Language:PythonMIT220800

audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

Language:PythonMIT233100

descript-audio-codec

State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.

Language:PythonMIT104400

DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Language:PythonMIT553700

DALL-E

PyTorch package for the discrete VAE used for DALL·E.

Language:PythonNOASSERTION1076600

ChatTTS

A generative speech model for daily dialogue.

Language:PythonAGPL-3.02816100

speech-trident

Awesome speech/audio LLMs, representation learning, and codec models

51200

USLM

Unified Speech Language Model for paper "SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models"(ICLR 2024)

Language:Python12400

FunCodec

FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al.

Language:PythonMIT31900

SpeechGPT

SpeechGPT Series: Speech Large Language Models

Language:PythonApache-2.0106300

MuLan

MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)

Language:Python11100

MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Language:PythonApache-2.0805400

OneChart

[ACM'MM 2024 Oral] Official code for "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token"

Language:PythonApache-2.012300

AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Language:C++Apache-2.0121200

AniPortrait

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Language:PythonApache-2.0434300

Awesome-Chart-Understanding

A curated list of recent and past chart understanding work based on our survey paper: From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models.

11800

YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

Language:PythonGPL-3.0404300