lucialagenial / The-NLP-Pandect

A comprehensive reference for all topics related to Natural Language Processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The-NLP-Pandect

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

Ukraine

Note Quick legend on available resource types:

⭐ - open source project, usually a GitHub repository with its number of stars

📙 - resource you can read, usually a blog post or a paper

🗂️ - a collection of additional resources

🔱 - non-open source tool, framework or paid service

🎥️ - a resource you can watch

🎙️ - a resource you can listen to

The-NLP-Resources

Note Section keywords: paper summaries, compendium, awesome list

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries
Conference Summaries

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and Compendiums

Pre-trained NLP models

NLP History

General
2020 Year in Review

The-NLP-Podcasts

NLP-only podcasts

Many NLP episodes

Some NLP episodes

The-NLP-Newsletter

The-NLP-Meetups

The-NLP-Youtube

The-NLP-Benchmarks

General NLU

  • GLUE - General Language Understanding Evaluation (GLUE) benchmark
  • SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
  • decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
  • dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
  • DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking

Summarization

  • WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
  • WikiLingua - A Multilingual Abstractive Summarization Dataset

Question Answering

  • SQuAD - Stanford Question Answering Dataset (SQuAD)
  • XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
  • GrailQA - Strongly Generalizable Question Answering (GrailQA)
  • CSQA - Complex Sequential Question Answering

Multilingual and Non-English Benchmarks

  • 📙 XTREME - Massively Multilingual Multi-task Benchmark
  • GLUECoS - A benchmark for code-switched NLP
  • IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
  • LinCE - Linguistic Code-Switching Evaluation Benchmark
  • Russian SuperGlue - Russian SuperGlue Benchmark

Bio, Law, and other scientific domains

  • BLURB - Biomedical Language Understanding and Reasoning Benchmark
  • BLUE - Biomedical Language Understanding Evaluation benchmark
  • LexGLUE - A Benchmark Dataset for Legal Language Understanding in English

Transformer Efficiency

Speech Processing

  • SUPERB - Speech processing Universal PERformance Benchmark

Other

  • CodeXGLUE - A benchmark dataset for code intelligence
  • CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
  • MultiNLI - Multi-Genre Natural Language Inference corpus
  • iSarcasm: A Dataset of Intended Sarcasm - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic

The-NLP-Research

General

Embeddings

Repositories

Blogs

Cross-lingual Word and Sentence Embeddings

  • vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 594 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 8200 stars]

Byte Pair Encoding

  • bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1068 stars]
  • subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1932 stars]
  • python-bpe - Byte Pair Encoding for Python [GitHub, 177 stars]

Transformer-based Architectures

General

Transformer

BERT

Other Transformer Variants

T5
BigBird
Reformer / Linformer / Longformer / Performers
Switch Transformer

GPT-family

General
GPT-3
Learning Resources
Applications
  • Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 3591 stars]
  • 🗂️ GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
  • 🗂️ GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
  • 🔱 OpenAI API - API Demo to use GPT-3 for commercial applications
Open-source Efforts

Other

Distillation, Pruning and Quantization

Reading Material
Tools
  • Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 56 stars]
  • XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 116 stars]

Automated Summarization

  • 📙 PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
  • CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 114 stars]
  • XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 165 stars]
  • SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 197 stars]
  • PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 86 stars]
  • summarus - Models for automatic abstractive summarization [GitHub, 143 stars]

Knowledge Graphs and NLP

The-NLP-Industry

Note Section keywords: best practices, MLOps

Best Practices for NLP

MLOps for NLP

MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.

In general, MLOps for NLP includes having the following processes in place:

  • Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
  • Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
  • Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
  • Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
  • Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
  • Data and Model Observability - track data drift, model accuracy drift etc.

Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:

  • Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
  • Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.

MLOps Compilations & Awesome Lists

Reading Material

Learning Material

  • 🗂 MLOps cource by Made With ML
  • 🗂 GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub

MLOps Communities

Data Versioning

  • DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
  • 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
  • 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]

Experiment Tracking

  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
  • 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
  • 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
  • 🔱 SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
  • Optuna - hyperparameter optimization framework [GitHub, 6750 stars]
  • Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5875 stars]
Model Registry
  • DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1481 stars]
  • 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
  • 🔱 Valohai - End-to-end ML pipelines [Paid Service]
  • 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
  • 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
  • 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]

Automated Testing and Behavioral Testing

  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1717 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2036 stars]
  • WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 73 stars]
  • Great Expectations - Write tests for your data [GitHub, 6965 stars]
  • Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 1785 stars]

Model Deployability and Serving

  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • 🔱 Amazon SageMaker [Paid Service]
  • 🔱 Valohai - End-to-end ML pipelines [Paid Service]
  • 🔱 NLP Cloud - Production-ready NLP API [Paid Service]
  • 🔱 Saturn Cloud [Paid Service]
  • 🔱 SELDON - machine learning deployment for enterprise [Paid Service]
  • 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
  • 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
  • TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 2761 stars]
  • 🔱 Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
  • KFServing - Serverless Inferencing on Kubernetes [GitHub, 1655 stars]
  • 🔱 TFX - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]
  • 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
  • 🔱 Cortex - containers as a service on AWS [Paid Service]
  • 🔱 Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
  • End2End Serverless Transformers On AWS Lambda [GitHub, 104 stars]
  • NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
  • 🔱 Dagster - data orchestrator for machine learning [Free and Open Source]
  • 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5875 stars]
  • flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 2557 stars]
  • MLRun - Machine Learning automation and tracking [GitHub, 776 stars]
  • 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI

Model Debugging

  • imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 875 stars]
  • Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 397 stars]

Model Accuracy Prediction

  • WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 745 stars]

Data and Model Observability

General
  • whylogs - open source standard for data and ML logging [GitHub, 1718 stars]
  • Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 1219 stars]
  • MLRun - Machine Learning automation and tracking [GitHub, 776 stars]
  • 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
  • 🔱 Cortex - containers as a service on AWS [Paid Service]
Model Centric
  • 🔱 Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
  • 🔱 Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
  • Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
  • 🔱 Fiddler - ML Model Performance Management Tool [Paid Service]
  • 🔱 Hydrosphere - open-source platform for managing ML models [Paid Service]
  • 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
  • 🔱 Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
  • 🔱 iguazio - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]
Data Centric
  • 🔱 Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
  • 🔱 acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
  • 🔱 Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
  • 🔱 datakin - end-to-end, real-time data lineage solution [Paid Service]
  • 🔱 Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
  • 🔱 SODA - data monitoring, testing and validation [Paid Service]
  • 🔱 whatify - data quality and action recommendation on it [Paid Service]

Feature Stores

  • 🔱 Tecton - enterprise feature store for machine learning [Paid Service]
  • FEAST - open source feature store for machine learning Website [GitHub, 3461 stars]
  • 🔱 Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]

Metadata Management

  • ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 486 stars]
  • 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]

MLOps Frameworks

  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5875 stars]
  • kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 7448 stars]
  • Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 3306 stars]
  • ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 2252 stars]
  • 🔱 Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
  • Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 1432 stars]
  • 🔱 Continual.ai - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]

Transformer-based Architectures

General

Multi-GPU Transformers
Training Transformers Effectively

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

  • Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 559 stars]
  • Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1197 stars]
  • FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 162 stars]
  • LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 525 stars]
  • NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
  • Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 377 stars]
  • BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 207 stars]

The-NLP-Speech

Note Section keywords: speech recognition

General Speech Recognition

  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6081 stars]
  • DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 20002 stars]
  • 📙 Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
  • kaldi - Kaldi is a toolkit for speech recognition [GitHub, 11821 stars]
  • awesome-kaldi - resources for using Kaldi [GitHub, 502 stars]
  • ESPnet - End-to-End Speech Processing Toolkit [GitHub, 5339 stars]
  • 📙 HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]

Text to Speech

  • FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 711 stars]
  • TTS - a deep learning toolkit for Text-to-Speech [GitHub, 5601 stars]

Datasets

  • VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 364 stars]

The-NLP-Topics

Note Section keywords: topic modeling

Blogs

Frameworks for Topic Modeling

  • gensim - framework for topic modeling [GitHub, 13426 stars]
  • Spark NLP [GitHub, 2863 stars]

Repositories

Keyword-Extraction

Note Section keywords: keyword extraction

Text Rank

  • PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1843 stars]
  • textrank - TextRank implementation for Python 3 [GitHub, 1136 stars]

RAKE - Rapid Automatic Keyword Extraction

  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 932 stars]
  • yake - Single-document unsupervised keyword extraction [GitHub, 1128 stars]
  • RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 365 stars]
  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 932 stars]

Other Approaches

  • flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5233 stars]
  • BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 218 stars]
  • keyBERT - Minimal keyword extraction with BERT [GitHub, 1717 stars]
  • KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 58 stars]

Further Reading

Responsible-NLP

Note Section keywords: ethics, responsible NLP

NLP and ML Interpretability

NLP-centric

General

  • Language Interpretability Tool (LIT) [GitHub, 2964 stars]
  • WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 394 stars]
  • Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 326 stars]
  • InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 4886 stars]
  • thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 101 stars]
  • Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 232 stars]
  • imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 875 stars]

Ethics, Bias, and Equality in NLP

Adversarial Attacks for NLP

Hate Speech Analysis

  • HateXplain - BERT for detecting abusive language [GitHub, 118 stars]

The-NLP-Frameworks

Note Section keywords: frameworks

General Purpose

  • spaCy by Explosion AI [GitHub, 23928 stars]
  • flair by Zalando [GitHub, 11910 stars]
  • AllenNLP by AI2 [GitHub, 11138 stars]
  • stanza (former Stanford NLP) [GitHub, 6237 stars]
  • spaCy stanza [GitHub, 650 stars]
  • nltk [GitHub, 10955 stars]
  • gensim - framework for topic modeling [GitHub, 13426 stars]
  • pororo - Platform of neural models for natural language processing [GitHub, 1132 stars]
  • NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2854 stars]
  • FARM [GitHub, 1561 stars]
  • gobbli by RTI International [GitHub, 270 stars]
  • headliner - training and deployment of seq2seq models [GitHub, 231 stars]
  • SyferText - A privacy preserving NLP framework [GitHub, 187 stars]
  • DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1227 stars]
  • TextHero - Text preprocessing, representation and visualization [GitHub, 2536 stars]
  • textblob - TextBlob: Simplified Text Processing [GitHub, 8241 stars]
  • AdaptNLP - A high level framework and library for NLP [GitHub, 401 stars]
  • textacy - NLP, before and after spaCy [GitHub, 1956 stars]
  • texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2295 stars]
  • jiant - jiant is an NLP toolkit [GitHub, 1433 stars]

Data Augmentation

  • WildNLP Text manipulation library to test NLP models [GitHub, 73 stars]
  • snorkel Framework to generate training data [GitHub, 5220 stars]
  • NLPAug Data augmentation for NLP [GitHub, 3423 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 358 stars]
  • faker - Python package that generates fake data for you [GitHub, 14532 stars]
  • textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 567 stars]
  • Parrot - Practical and feature-rich paraphrasing framework [GitHub, 530 stars]
  • AugLy - data augmentations library for audio, image, text, and video [GitHub, 4515 stars]
  • TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 256 stars]

Adversarial NLP Attacks & Behavioral Testing

  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2036 stars]
  • CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5554 stars]
  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1717 stars]

Transformer-oriented

  • transformers by HuggingFace [GitHub, 67950 stars]
  • Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 942 stars]
  • haystack - Transformers at scale for question answering & neural search. [GitHub, 5151 stars]

Dialog Systems and Speech

  • DeepPavlov by MIPT [GitHub, 5824 stars]
  • ParlAI by FAIR [GitHub, 9088 stars]
  • rasa - Framework for Conversational Agents [GitHub, 14655 stars]
  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6081 stars]
  • ChatterBot - conversational dialog engine for creating chat bots [GitHub, 12456 stars]
  • SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 4411 stars]

Word/Sentence-embeddings oriented

  • MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2986 stars]
  • vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 594 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 8200 stars]

Social Media Oriented

  • Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 569 stars]

Phonetics

  • DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 161 stars]

Morphology

  • LemmInflect - python module for English lemmatization and inflection [GitHub, 167 stars]
  • Inflect - generate plurals, ordinals, indefinite articles [GitHub, 682 stars]
  • simplemma - simple multilingual lemmatizer for Python [GitHub, 682 stars]

Multi-lingual tools

  • polyglot - Multi-lingual NLP Framework [GitHub, 2030 stars]
  • trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 621 stars]

Distributed NLP / Multi-GPU NLP

Machine Translation

  • COMET -A Neural Framework for MT Evaluation [GitHub, 156 stars]
  • marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 947 stars]
  • argos-translate - Open source neural machine translation in Python [GitHub, 1272 stars]
  • Opus-MT - Open neural machine translation models and web services [GitHub, 219 stars]
  • dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 220 stars]

Entity and String Matching

  • PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 531 stars]
  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 736 stars]
  • fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8725 stars]
  • jellyfish - approximate and phonetic matching of strings [GitHub, 1697 stars]
  • textdistance - Compute distance between sequences [GitHub, 2917 stars]
  • DeepMatcher - Compute distance between sequences [GitHub, 437 stars]
  • RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 329 stars]
  • Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 7 stars]

Discourse Analysis

  • ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 365 stars]

PII scrubbing

  • scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 293 stars]

Hastag Segmentation

  • hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 40 stars]

Books Analysis / Literary Analysis

  • booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 571 stars]
  • bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 72 stars]

Non-English oriented

Japanese

  • fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 238 stars]
  • SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 309 stars]
  • Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 177 stars]
  • jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 142 stars]
  • Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 595 stars]
  • kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 831 stars]
  • nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 308 stars]
  • KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 185 stars]
  • Jigg - Pipeline framework for easy natural language processing [GitHub, 71 stars]
  • Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 303 stars]
  • RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 444 stars]
  • toiro - a comparison tool of Japanese tokenizers [GitHub, 103 stars]

Other

  • textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 93 stars]
  • Kashgari Transfer Learning with focus on Chinese [GitHub, 2315 stars]
  • Underthesea - Vietnamese NLP Toolkit [GitHub, 1001 stars]
  • PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 57 stars]

Text Data Labelling

  • Small-Text - Active Learning for Text Classifcation in Python [GitHub, 297 stars]
  • Doccano - open source annotation tool for machine learning practitioners [GitHub, 6538 stars]
  • 🔱 Prodigy - annotation tool powered by active learning [Paid Service]

The-NLP-Learning

Note Section keywords: learn NLP

General

Courses

Books

Tutorials

The-NLP-Communities

Other-NLP-Topics

Tokenization

  • tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 5802 stars]
  • SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 6089 stars]
  • SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 104 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks
  • WildNLP Text manipulation library to test NLP models [GitHub, 73 stars]
  • NLPAug Data augmentation for NLP [GitHub, 3423 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 358 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2036 stars]
  • skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 810 stars]
  • NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 644 stars]
  • EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1295 stars]
  • snorkel Framework to generate training data [GitHub, 5220 stars]
Reading Material and Tutorials

Named Entity Recognition (NER)

Relation Extraction

  • tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 329 stars]
  • tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 54 stars]
  • tac-self-attention Relation extraction with position-aware self-attention [GitHub, 63 stars]
  • Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 36 stars]

Coreference Resolution

Sentiment Analysis

Domain Adaptation

Low Resource NLP

Spell Correction / Error Correction

  • Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1160 stars]
  • NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 477 stars]
  • SymSpellPy - Python port of SymSpell [GitHub, 585 stars]
  • 📙 Speller100 by Microsoft [Blog, Feb 2021]
  • JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 501 stars]
  • pycorrector - spell correction for Chinese [GitHub, 3400 stars]
  • contractions - Fixes contractions such as you're to you are [GitHub, 241 stars]

Style Transfer for NLP

  • Styleformer - Neural Language Style Transfer framework [GitHub, 394 stars]
  • StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 43 stars]

Automata Theory for NLP

  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 736 stars]

Obscene words detection

  • LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1862 stars]

Reddit Analysis

  • Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 480 stars]

Skill Detection

  • SkillNER - rule based NLP module to extract job skills from text [GitHub, 55 stars]

Reinforcement Learning for NLP

  • nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 128 stars]

AutoML / AutoNLP

  • AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 681 stars]
  • TPOT - Python Automated Machine Learning tool [GitHub, 8682 stars]
  • Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1722 stars]
  • HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 660 stars]
  • 🔱 AutoML Natural Language - Google's paid AutoML NLP service
  • Optuna - hyperparameter optimization framework [GitHub, 6750 stars]
  • FLAML - fast and lightweight AutoML library [GitHub, 1990 stars]
  • Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 282 stars]

OCR - Optical Character Recognition

Text Generation

Title / Headlines Generation

  • TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 66 stars]

NLP research reproducibility

License CC0

Attributions

Resources

  • All linked resources belong to original authors

Icons

Fonts


The Pandect Series also includes

     

About

A comprehensive reference for all topics related to Natural Language Processing

License:Creative Commons Zero v1.0 Universal


Languages

Language:Python 100.0%