There are 10 repositories under nlp-datasets topic.
curated collection of papers for the nlp practitioner 📖👩🔬
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Resource NLP & Bahasa
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。
The release of the FreebaseQA data set (NAACL 2019).
Yorùbá language training text for NLP, ASR and TTS tasks
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
A collection of datasets for Ukrainian language
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
a small test dataset for use with OpenAI's ChatGPT
A list of Romanian NLP Datasets
Question Answering System using BiDAF Model on SQuAD v2.0
Library for generation of russian names
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
English loanwords in Japanese
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
Arabic Dictionaries