Zichun Yu's repositories
yuzc19.github.io
Personal homepage for Zichun Yu
zcore-tests
Test scripts for zCore OS
dclm
DataComp for Language Models
doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
galactic
data cleaning and curation for unstructured text
lit-gpt
Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
lm-evaluation-harness
A framework for few-shot evaluation of language models.
Megatron-LM
Ongoing research training transformer models at scale
NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
Pai-Megatron-Patch
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
SemDeDup
Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically similar, but not exactly identical).