Rao Ma's starred repositories
CIF-HieraDist
[INTERSPEECH 2023] Knowledge Transfer from Pre-trained Language Models to Cif-based Recognizers via Hierarchical Distillation
CIF-PyTorch
[ICASSP 2020] CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition (A PyTorch implementation of Continuous Integrate-and-Fire mechanism).
ChatGLM-6B
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Awesome-Chinese-LLM
整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。
prepend_acoustic_attack
Prepend universal audio attack segment to mute Whisper
VoiceCraft
Zero-Shot Speech Editing and Text-to-Speech in the Wild
lm-contamination
The LM Contamination Index is a manually created database of contamination evidences for LMs.
libriheavy
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
llm_interview_note
主要记录大语言大模型(LLMs) 算法(应用)工程师相关的知识及面试题
Machine-Learning-Interviews
This repo is meant to serve as a guide for Machine Learning/AI technical interviews.
TED-Multilingual-Parallel-Corpus
TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages.
Awesome-Video-Datasets
Video datasets
TikTok-Api
The Unofficial TikTok API Wrapper In Python
unified_multilingual_dataset_of_emotional_human_utterances
A unified dataset of multilingual emotional human utterances
mass-dataset
MaSS - Multilingual corpus of Sentence-aligned Spoken utterances
seamless_communication
Foundational Models for State-of-the-Art Speech and Text Translation
minChatGPT
A minimum example of aligning language models with RLHF similar to ChatGPT
long-context-asr
Code for the paper: How Much Context Does My Attention-Based ASR System Need?
comparative-assessment
Framework for using LLMs to grade texts by using pairwise comparisons.
faster-whisper
Faster Whisper transcription with CTranslate2