Li-Kuang Chen's starred repositories
Chinese-BERT-wwm
Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
ImageOptim-CLI
Make optimisation of images part of your automated build process
roberta_zh
RoBERTa中文预训练模型: RoBERTa for Chinese
data-juicer
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
TransformerLens
A library for mechanistic interpretability of GPT-style language models
awesome-instruction-learning
Papers and Datasets on Instruction Tuning and Following. ✨✨✨
piicatcher
Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub
conversationai-models
A repository to house model building experiments and tools that are part of the Conversation AI effort.
c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
features-across-time
Understanding how features learned by neural networks evolve throughout training
perspectiveapi-proxy
Example code for an authenticated proxy for requests to the Perspective API