There are 20 repositories under corpus topic.
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
搜索所有中文NLP数据集,附常用英文NLP数据集
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Deep Learning and deep reinforcement learning research papers and some codes
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
:helicopter: 保险行业语料库,聊天机器人
非常全的文言文(古文)-现代文平行语料
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Chatbot in 200 lines of code using TensorLayer
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
微信公众号语料库
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
A dataset of millions of news articles scraped from a curated list of data sources.
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/