Shu Li Zheng (nezhazheng)

nezhazheng

Geek Repo

Company:KingSoft

Location:Bei Jing

Home Page:nezhazheng.com

Github PK Tool:Github PK Tool

Shu Li Zheng's starred repositories

datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Language:PythonLicense:Apache-2.0Stargazers:1616Issues:0Issues:0

AISystem

AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:8951Issues:0Issues:0

InsTag

InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning

Stargazers:130Issues:0Issues:0

deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]

Language:PythonLicense:Apache-2.0Stargazers:377Issues:0Issues:0

dataverse

The Universe of Data. All about data, data science, and data engineering

Language:PythonLicense:Apache-2.0Stargazers:414Issues:0Issues:0

llm.c

LLM training in simple, raw C/CUDA

Language:CudaLicense:MITStargazers:20407Issues:0Issues:0

grok-1

Grok open release

Language:PythonLicense:Apache-2.0Stargazers:48969Issues:0Issues:0

Agent-Pro

The Code Repo for Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization

Language:PythonStargazers:66Issues:0Issues:0

Cradle

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Language:PythonLicense:MITStargazers:588Issues:0Issues:0

DouZero

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

Language:PythonLicense:Apache-2.0Stargazers:4000Issues:0Issues:0

rlcard

Reinforcement Learning / AI Bots in Card (Poker) Games - Blackjack, Leduc, Texas, DouDizhu, Mahjong, UNO.

Language:PythonLicense:MITStargazers:2743Issues:0Issues:0

Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Language:PythonLicense:Apache-2.0Stargazers:17333Issues:0Issues:0
Language:PythonLicense:MITStargazers:3922Issues:0Issues:0

trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Language:PythonLicense:Apache-2.0Stargazers:3071Issues:0Issues:0

boilerpipe

Work in progress transmit from Google Code

Language:JavaLicense:NOASSERTIONStargazers:1093Issues:0Issues:0

jusText

Heuristic based boilerplate removal tool

Language:PythonLicense:BSD-2-ClauseStargazers:692Issues:0Issues:0

readability

A standalone version of the readability lib

Language:JavaScriptLicense:NOASSERTIONStargazers:8229Issues:0Issues:0

Html2Article

Html网页正文提取

Language:C#License:NOASSERTIONStargazers:488Issues:0Issues:0

html-extractor

基于行块分布函数的通用网页正文抽取算法优化,Python实现

Language:PythonStargazers:51Issues:0Issues:0

AlphaCodium

Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

Language:PythonLicense:AGPL-3.0Stargazers:3225Issues:0Issues:0

ungoliant

:spider: The pipeline for the OSCAR corpus

Language:RustLicense:Apache-2.0Stargazers:152Issues:0Issues:0

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonLicense:Apache-2.0Stargazers:1527Issues:0Issues:0

pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2095Issues:0Issues:0

leptonai

A Pythonic framework to simplify AI service building

Language:PythonLicense:Apache-2.0Stargazers:2496Issues:0Issues:0

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

List of Dirty, Naughty, Obscene, and Otherwise Bad Words

License:CC-BY-4.0Stargazers:2792Issues:0Issues:0

ai-notes

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

Language:HTMLLicense:MITStargazers:4768Issues:0Issues:0

amber-data-prep

Data preparation code for Amber 7B LLM

Language:PythonLicense:Apache-2.0Stargazers:62Issues:0Issues:0
Language:PythonStargazers:125Issues:0Issues:0

SDV

Synthetic data generation for tabular data

Language:PythonLicense:NOASSERTIONStargazers:2175Issues:0Issues:0

MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

License:MITStargazers:3111Issues:0Issues:0