esbatmop

esbatmop

Geek Repo

Github PK Tool:Github PK Tool


Organizations
doing-data-science

esbatmop's repositories

MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

deduplication_mnbvc

专业对大文本进行文本去重的工具

Language:PythonLicense:MITStargazers:6Issues:0Issues:0
Language:PythonLicense:MITStargazers:0Issues:1Issues:0

carrot

Free ChatGPT Site List 这儿为你准备了众多免费好用的ChatGPT镜像站点,当前100+站点

Stargazers:0Issues:0Issues:0

forum_dialogue_mnbvc

论坛对话语料清洗

License:MITStargazers:0Issues:1Issues:0

github_downloader_mnbvc

github仓库下载器

Language:PythonStargazers:0Issues:1Issues:0

githubcode_extractor_mnbvc

用于提取github-code-zip文件的内容,并保存为jsonl格式

Language:PythonStargazers:0Issues:1Issues:0

jsonlbugfix_mnbvc

修复爬虫jsonl的bug

Language:PythonLicense:MITStargazers:0Issues:1Issues:0
Language:PythonStargazers:0Issues:0Issues:0
Language:PythonStargazers:0Issues:1Issues:0

pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

WikiHowQAExtractor-mnbvc

Extract Chinese/English QA Data from WikiHow pages.

Language:PythonLicense:MITStargazers:0Issues:1Issues:0