There are 10 repositories under nlp-datasets topic.
curated collection of papers for the nlp practitioner 📖👩🔬
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
Resource NLP & Bahasa
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
Yorùbá language training text for NLP, ASR and TTS tasks
The release of the FreebaseQA data set (NAACL 2019).
A collection of datasets for Ukrainian language
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
A list of Romanian NLP Datasets
Compilation of Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
a small test dataset for use with OpenAI's ChatGPT
Question Answering System using BiDAF Model on SQuAD v2.0
Library for generation of russian names
Arabic Dictionaries
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
An annotated Chinese metaphor dataset
English loanwords in Japanese
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
A Python library designed for scraping data from the SCP wiki.