nlp-datasets

There are 10 repositories under nlp-datasets topic.

mihail911 / nlp-library
curated collection of papers for the nlp practitioner 📖👩‍🔬
neural-network dialogue nlp machine-learning neural-machine-translation deep-learning language-model nlp-datasets
1070
guhhhhaa / 4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
chinese-nlp corpus corpus-data datasets nlp nlp-datasets nlp-machine-learning nlp-resources science-fiction scifi
425
hellohaptik / multi-task-NLP
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
pytorch multitask-learning sentence-classification sequence-labeling entailment ranking intent-classification named-entity-recognition machine-comprehension context-awareness transformers nlp nlp-library nli-tasks nlp-datasets nlp-apis
Language:Python 372
dkulagin / kartaslov
Открытые лингвистические датасеты: тональный словарь русского языка КартаСловСент, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.
nlp-datasets computational-linguistics datasets russian-specific
370
quincyliang / nlp-public-dataset
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集，中英文机器翻译数据集, 中文分词数据集
machine-learning-dataset nlp-datasets
Language:Python 367
appworld
StonyBrookNLP / appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
acl-2024 ai-agents ai-apis ai-assistants ai-environment ai-planning autonomous-agents coding-agents function-calling interactive-coding llm llm-agents nlp-datasets nlp-machine-learning tool-usage
Language:Python 306
irfnrdh / Awesome-Indonesia-NLP
Resource NLP & Bahasa
awesome indonesian-language nlp-datasets nlp-resources
269
grammarly / ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
dataset corpus gec grammatical-error-correction ukrainian-language corpus-data corpus-tools natural-language-processing nlp-datasets
Language:Macaulay2 264
liutiedong / goat
a Fine-tuned LLaMA that is Good at Arithmetic Tasks
ai llms nlp-datasets
Language:Jupyter Notebook 178
cjiang2 / VDCNN
Implementation of Very Deep Convolutional Neural Network for Text Classification
convolutional-neural-networks keras keras-tensorflow nlp nlp-datasets tensorflow text-classification vdcnn
Language:Python 172
INK-USC / TriggerNER
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
named-entity-recognition dataset nlp-resources nlp-datasets information-extraction sequence-tagging low-resource
Language:Python 172
INK-USC / CommonGen
A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
natural-language-processing commonsense-reasoning nlg-dataset natural-language-generation language-generation-dataset machine-reasoning deep-learning text-generation nlp-datasets
Language:Python 141
secsilm / zi-dataset
汉字数据集，包括汉字的相关信息，例如笔画数、部首、拼音、英文释义/同义词等。
nlp chinese-nlp chinese-dataset dataset hanzi nlp-datasets
124
guhhhhaa / wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料
corpus corpus-data nlp nlp-datasets nlp-machine-learning nlp-resources science-fiction scifi chinese-nlp datasets
123
xtea / chinese_medical_words
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
nlp chinese-nlp nlp-datasets medical nlp-data-to-text chinese-word-segmentation
122
Niger-Volta-LTI / yoruba-text
Yorùbá language training text for NLP, ASR and TTS tasks
african-languages natural-language-processing diacritization machine-translation training-dataset nlp yoruba tts asr nlp-datasets
Language:Python 81
kelvin-jiang / FreebaseQA
The release of the FreebaseQA data set (NAACL 2019).
freebaseqa freebase kb-qa nlp-datasets question-answering naacl
72
HistSumm
Pzoom522 / HistSumm
Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)
ancient-languages cross-lingual-summarization eacl2021 historical-text nlp-datasets summariser
Language:Jupyter Notebook 72
fido-ai / ua-datasets
A collection of datasets for Ukrainian language
dataset ukrainian-language nlp text-classification token-classification question-answering nlp-datasets natural-language-processing
Language:Python 56
gcunhase / AMICorpusXML
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
nlp-datasets meeting-dataset xml-to-story convert-to-cnn-dm-format
Language:Python 56
AndyTheFactory / romanian-nlp-datasets
A list of Romanian NLP Datasets
nlp nlp-datasets nlp-resources romanian romanian-language nlp-dataset nlp-data
53
selimfirat / bilkent-turkish-writings-dataset
Compilation of Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
dataset nlp-datasets creative-writing nlp pdf-conversion bilkent-university turkish turkish-language
Language:Python 51
afrisenti-semeval / afrisent-semeval-2023
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
african-languages africanlp low-resolution-data low-resouce-language low-resource-nlp opinion-mining semeval-sentiment sentiment sentiment-analysis sentiment-classification semeval2023 shared-task shared-tasks dataset datasets nlp-dataset nlp-datasets twitt twitter twitter-sentiment-analysis
Language:Jupyter Notebook 49
matt-seb-ho / WikiWhy
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
artificial-intelligence dataset explainable-ai iclr2023 machine-learning nlp nlp-datasets open-domain-qa question-answering
Language:Python 48
gkiril / benchie
Comprehensive evaluation framework for Open Information Extraction.
open-information-extraction information-extraction benchmark-framework natural-language-processing natural-language-understanding nlp nlp-datasets dataset
Language:Python 39
bothub-it / bothub
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
bothub bots chatbot data database docker ilhasoft issue-tracker multiple-languages nlp nlp-datasets push python sharing-nlp-datasets webapp
Language:Makefile 38
uma-pi1 / OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
open-information-extraction information-extraction corpus corpus-data corpus-tools natural-language-processing natural-language-understanding nlp nlp-resources nlp-datasets wikipedia wiki wikipedia-dump wikipedia-corpus corpus-processing corpora dataset dataset-interface
Language:Java 38
gpt-tester / ChatGPT-test-dataset-01
a small test dataset for use with OpenAI's ChatGPT
chatgpt chatgpt-api openai openai-api go golang gpt-3 gpt3 python rust rust-lang nlp nlp-datasets nlp-machine-learning nlp-parsing
32
ElizaLo / Question-Answering-based-on-SQuAD
Question Answering System using BiDAF Model on SQuAD v2.0
bidaf machine-learning natural-language-processing natural-language-understanding neural-network nlp nlp-datasets nlp-machine-learning python python-3-6 question-answering squad
Language:Python 26
cybermatt / russian-names
Library for generation of russian names
text-processing text-generation nlp-datasets
Language:Python 24
Arabic-Dictionaries
OSINTAI / Arabic-Dictionaries
Arabic Dictionaries
arabic arabic-dictionaries arabic-lexicons data dictionary nlp-datasets txt
24
INK-USC / XCSR
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
commonsense-reasoning crosslingual-transfer multilingual-models natural-language-understanding nlp-datasets
Language:Python 22
JasonShao55 / Chinese_Metaphor_Explanation
An annotated Chinese metaphor dataset
chinese metaphor nlp nlp-datasets
Language:Python 20
jamesohortle / loanwords_gairaigo
English loanwords in Japanese
japanese english linguistics nlp phonetics linguistics-databases nlp-datasets
Language:Python 18
utahnlp / infotabs-code
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
nlp nlp-datasets nlp-machine-learning acl2020 wikipedia tables semi-structured-data svm roberta transformer nli inference infotabs
Language:Python 18
JadynHax / scpscraper
A Python library designed for scraping data from the SCP wiki.
scp scp-foundation webscraping webscraper python python3 data-collection dataset-generation dataset-creation pypi pypi-package nlp-dataset-creation nlp-datasets training-data-generation
Language:Python 16