trisitc / The-NLP-Pandect

A comprehensive reference for all topics related to Natural Language Processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.


Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

NLP Year in Review



NLP-only podcasts

Many NLP episodes

Some NLP episodes





General NLU

  • GLUE - General Language Understanding Evaluation (GLUE) benchmark
  • SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
  • decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
  • RACE - ReAding Comprehension dataset collected from English Examinations
  • dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
  • DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking


  • WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset

Question Answering

  • SQuAD - Stanford Question Answering Dataset (SQuAD)
  • XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
  • GrailQA - Strongly Generalizable Question Answering (GrailQA)
  • CSQA - Complex Sequential Question Answering

Multilingual and Non-English Benchmarks

  • XTREME - Massively Multilingual Multi-task Benchmark
  • GLUECoS - A benchmark for code-switched NLP
  • IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
  • IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
  • LinCE - Linguistic Code-Switching Evaluation Benchmark

Bio, Law, and other scientific domains

  • BLURB - Biomedical Language Understanding and Reasoning Benchmark
  • BLUE - Biomedical Language Understanding Evaluation benchmark

Transformer Efficiency


  • CodeXGLUE - A benchmark dataset for code intelligence
  • CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
  • MultiNLI - Multi-Genre Natural Language Inference corpus






Cross-lingual Word and Sentence Embeddings

  • vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 546 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 5123 stars]

Byte Pair Encoding

  • bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 947 stars]
  • subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1702 stars]
  • python-bpe - Byte Pair Encoding for Python [GitHub, 146 stars]

Transformer-based Architectures




Other Transformer Variants

Reformer / Linformer / Longformer / Performers
Switch Transformer


Learning Resources
  • Aweseome GPT-3 - list of all resources related to GPT-3 [GitHub, 3148 stars]
  • GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
  • OpenAI API - API Demo to use GPT-3 for commercial applications
Open-source Efforts
  • GPT-Neo - in-progress GPT-3 open source replication


Distillation, Pruning and Quantization

Automated Summarization

Rule-based NLP

  • LemmInflect - A python module for English lemmatization and inflection


Best Practices for NLP

Transformer-based Architectures

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

Model and Data testing

  • WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 64 stars]
  • Great Expectations - Write tests for your data [GitHub, 4450 stars]
  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1371 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1465 stars]


General Speech Recognition

  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5776 stars]
  • DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 17384 stars]
  • Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
  • kaldi - Kaldi is a toolkit for speech recognition [GitHub, 10469 stars]
  • awesome-kaldi - resources for using Kaldi [GitHub, 421 stars]
  • ESPnet - End-to-End Speech Processing Toolkit [GitHub, 3796 stars]

Text to Speech

  • FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 621 stars]
  • TTS - a deep learning toolkit for Text-to-Speech [GitHub, 1475 stars]



Frameworks for Topic Modeling

  • gensim - framework for topic modeling [GitHub, 12090 stars]
  • Spark NLP [GitHub, 2143 stars]



Text Rank

  • PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1544 stars]
  • textrank - TextRank implementation for Python 3 [GitHub, 1032 stars]

RAKE - Rapid Automatic Keyword Extraction

  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 818 stars]
  • yake - Single-document unsupervised keyword extraction [GitHub, 663 stars]
  • RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 351 stars]
  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 818 stars]



NLP and ML Interpretability

  • Language Interpretability Tool (LIT) [GitHub, 2548 stars]
  • WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 266 stars]
  • Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 249 stars]
  • InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 3784 stars]
  • ecco - Tools to visuals and explore NLP language models [GitHub, 789 stars]
  • NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 195 stars]
  • transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 357 stars]

Ethics, Bias, and Equality in NLP

Adversarial Attacks for NLP


General Purpose

  • spaCy by Explosion AI [GitHub, 20522 stars]
  • flair by Zalando [GitHub, 10381 stars]
  • AllenNLP by AI2 [GitHub, 10048 stars]
  • stanza (former Stanford NLP) [GitHub, 5425 stars]
  • spaCy stanza [GitHub, 523 stars]
  • nltk [GitHub, 9897 stars]
  • gensim - framework for topic modeling [GitHub, 12090 stars]
  • pororo - Platform of neural models for natural language processing [GitHub, 884 stars]
  • NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2671 stars]
  • FARM [GitHub, 1197 stars]
  • gobbli by RTI International [GitHub, 253 stars]
  • headliner - training and deployment of seq2seq models [GitHub, 228 stars]
  • SyferText - A privacy preserving NLP framework [GitHub, 176 stars]
  • DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1093 stars]
  • TextHero - Text preprocessing, representation and visualization [GitHub, 2203 stars]
  • textblob - TextBlob: Simplified Text Processing [GitHub, 7677 stars]
  • AdaptNLP - A high level framework and library for NLP [GitHub, 317 stars]
  • textacy - NLP, before and after spaCy [GitHub, 1677 stars]
  • texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2164 stars]
  • jiant - jiant is an NLP toolkit [GitHub, 1256 stars]

Data Augmentation


  • WildNLP Text manipulation library to test NLP models [GitHub, 64 stars]
  • snorkel Framework to generate training data [GitHub, 4622 stars]
  • NLPAug Data augmentation for NLP [GitHub, 1998 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 317 stars]
  • faker - Python package that generates fake data for you [GitHub, 12557 stars]
  • textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 442 stars]
  • Parrot - Practical and feature-rich paraphrasing framework [GitHub, 172 stars]

Papers & Blogs

Adversarial NLP Attacks

  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1465 stars]
  • CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5110 stars]

Non-English oriented

  • textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 82 stars]
  • Kashgari Transfer Learning with focus on Chinese [GitHub, 2106 stars]
  • Underthesea - Vietnamese NLP Toolkit [GitHub, 846 stars]


  • transformers by HuggingFace [GitHub, 46399 stars]
  • Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 423 stars]
  • haystack - Transformers at scale for question answering & neural search. [GitHub, 1842 stars]

Dialog Systems and Speech

  • DeepPavlov by MIPT [GitHub, 5207 stars]
  • ParlAI by FAIR [GitHub, 7211 stars]
  • rasa - Framework for Conversational Agents [GitHub, 11387 stars]
  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5776 stars]
  • ChatterBot - conversational dialog engine for creating chat bots [GitHub, 11168 stars]

Word/Sentence-embeddings oriented

  • MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2777 stars]
  • vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 546 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 5123 stars]

Multi-lingual tools

  • polyglot - Multi-lingual NLP Framework [GitHub, 1836 stars]
  • trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 459 stars]

Distributed NLP

Machine Translation

  • COMET -A Neural Framework for MT Evaluation [GitHub, 63 stars]
  • marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 797 stars]
  • argos-translate - Open source neural machine translation in Python [GitHub, 585 stars]
  • Opus-MT - Open neural machine translation models and web services [GitHub, 131 stars]
  • dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 139 stars]

Entity and String Matching

  • PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 318 stars]
  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 610 stars]
  • fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8108 stars]
  • jellyfish - approximate and phonetic matching of strings [GitHub, 1450 stars]
  • textdistance - Compute distance between sequences [GitHub, 1976 stars]
  • DeepMatcher - Compute distance between sequences [GitHub, 315 stars]

Discourse Analysis

  • ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 272 stars]

PII scrubbing

  • scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 238 stars]









  • tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 4548 stars]
  • SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 5069 stars]
  • SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 89 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks
  • WildNLP Text manipulation library to test NLP models [GitHub, 64 stars]
  • snorkel Framework to generate training data [GitHub, 4622 stars]
  • NLPAug Data augmentation for NLP [GitHub, 1998 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 317 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1465 stars]
Blogs and Tutorials

Named Entity Recognition (NER)

Relation Extraction

  • tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 278 stars]
  • tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 40 stars]
  • tac-self-attention Relation extraction with position-aware self-attention [GitHub, 58 stars]

Coreference Resolution

Domain Adaptation

Low Resource NLP

Spell Correction

  • NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 167 stars]
  • SymSpellPy - Python port of SymSpell [GitHub, 439 stars]
  • Speller100 by Microsoft [Blog, Feb 2021]

Automata Theory for NLP

  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 610 stars]

Obscene words detection

  • LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1391 stars]

Reinforcement Learning for NLP

  • nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 89 stars]

AutoML / AutoNLP

  • AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 512 stars]
  • TPOT - Python Automated Machine Learning tool [GitHub, 8023 stars]
  • Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1228 stars]
  • HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 616 stars]
  • AutoML Natural Language - Google's paid AutoML NLP service

Text Generation

License CC0



  • All linked resources belong to original authors




A comprehensive reference for all topics related to Natural Language Processing

License:Creative Commons Zero v1.0 Universal


Language:Python 100.0%