hexixiang / dataharvest

DataHarvest is a toolkit specifically designed for building datasets for large language models. It provides a series of pipelines for data acquisition, cleaning, and processing, aiming to deliver high-quality training data for Chinese large language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hexixiang/dataharvest Stargazers