yui-mhcp / nlp

NLP project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

πŸ˜‹ Natural Language Processing (NLP) & Natural Language Understanding (NLU)

This github is an extension of the @Ananas120 Master thesis' repository, extending my base project to Q&A. I have generalized and cleaned up his code to allow general NLP tasks (and not only Q&A). Thanks to him for his contribution ! πŸ˜„

Check the CHANGELOG file to have a global overview of the latest modifications ! πŸ˜‹

IMPORTANT NOTE : This project is currently updated, meaning that multiple features may not currently work ! They will be updated in the near future with new fancy applications and models ! πŸ˜„

Project structure

β”œβ”€β”€ custom_architectures
β”œβ”€β”€ custom_layers
β”œβ”€β”€ custom_train_objects
β”‚   β”œβ”€β”€ losses
β”‚   β”‚   └── qa_retriever_loss.py    : special loss for AnswerRetriever model
β”‚   β”œβ”€β”€ metrics
β”‚   β”‚   β”œβ”€β”€ f1.py           : F1 implementation as a `tf.keras.metrics.Metric` class
β”‚   β”‚   └── top_k_f1.py     : extension to support Beam-Search output
β”œβ”€β”€ datasets
β”œβ”€β”€ hparams
β”œβ”€β”€ loggers
β”œβ”€β”€ models
β”‚   β”œβ”€β”€ nlu             : general NLU classes
β”‚   β”‚   β”œβ”€β”€ base_nlu_generator.py   : extension of `BaseNLUModel` for text-generative models
β”‚   β”‚   β”œβ”€β”€ base_nlu_model.py       : general interface defining data loading for text-based models
β”‚   β”‚   └── nlu_utils.py            : utilities for the NLU models
β”‚   β”œβ”€β”€ qa              : directory for Q&A based models
β”‚   β”‚   β”œβ”€β”€ answer_generator.py     : model that generates an answer
β”‚   β”‚   β”œβ”€β”€ answer_retriever.py     : model that retrieves the answer within the context
β”‚   β”‚   β”œβ”€β”€ mag.py                  : extension of `AnswerGenerator` to support MAG-style
β”‚   β”‚   β”œβ”€β”€ question_generator.py   : model that generates a question based on an answer
β”‚   β”‚   └── web_utils.py            : utilities to search on the web and parse results (for Q&A inputs)
β”œβ”€β”€ pretrained_models
β”œβ”€β”€ unitest
β”œβ”€β”€ utils
β”œβ”€β”€ CITATIONS.thesis.bib    : citations for the master thesis
β”œβ”€β”€ Dockerfile-maggie       : runs the maggie bot in a Docker container
β”œβ”€β”€ Makefile                : defines commands to run / stop maggie
β”œβ”€β”€ docker-compose-maggie.yml   : runs maggie in docker-compose
β”œβ”€β”€ example_answer_generator.ipynb
β”œβ”€β”€ example_mag.ipynb
β”œβ”€β”€ experiments.py          : abstract file defining functions to run multiple experiments
β”œβ”€β”€ experiments_mag.py      : defines the functions to run MAG experiments
β”œβ”€β”€ maggie.py               : the code for the MAGgie bot
β”œβ”€β”€ main.py                 : main file to run build / train / test / predict command-line
└── question_answering.ipynb

Check the main project for more information about the unextended modules / structure / main classes.

Available features

  • Question Answering (module models.qa) :
Feature Fuction / class Description
Q&A answer_from_web Performs question-answering based on top-k most relevant web pages

You can check the question_answering notebook for a concrete demonstration

Available models

Model architectures

Available architectures :

  • BERT
  • BART
  • GPT-2
  • MAG : a general wrapper for text-based Transformers.

Model weights

Task Name Lang Class Dataset Trainer Weights
Q&A maggie en MAG NQ, CoQA, NewsQA Ananas120 Google Drive

Weights will be added in the next update

Models must be unzipped in the pretrained_models/ directory !

Installation and usage

  1. Clone this repository : git clone https://github.com/yui-mhcp/nlp.git
  2. Go to the root of this repository : cd nlp
  3. Install requirements : pip install -r requirements.txt
  4. Open question_answering notebook and follow the instruction !

TO-DO list :

  • Make the TO-DO list
  • Clean-up the code
  • Comment the code
  • Create general NLU classes
  • Allow to modify the default configuration in predict method (experimental)
  • Extend for general NLU tasks (and not only Q&A)
    • Add Masked Language Modeling (MLM) support
    • Add Next Word Prediction (NWP) support
    • Add Neural Machine Translation (NMT) support
    • Add text classification tasks (such as intent / emotion / topic classification)
  • Add pretrained models (for Q&A in English) (Ananas120's master thesis models)
  • Add new languages support
  • Add document parsing to perform Q&A on document (in progress)
  • Convert the llama2 model to tensorflow

NLP vs NLU

Natural Language Processing (NLP) and Natural Language Understanding (NLU) are general terms that groups a bunch of tasks related to language. Both are mainly used the same way but in theory, NLP is larger than NLU as it groups both NLU and speech-related tasks, such as Text-To-Speech (TTS) and Speech-To-Text (STT), while NLU more reflects text understanding tasks (MLM, NMT, NWP, Q&A, text-classification, ...). For this reason, it is possible that this repo will be duplicated into a nul/ repository and this one will integrate nlu as well as existing TTS and STT repositories.

Furthermore, the term understanding is an exageration as models do not really understands the language / the concepts behind words but mimic what they have learned to do / what contains their training database.

Pipeline-based prediction

The BaseNLUModel (and its subclasses) model supports the pipeline-based prediction, meaning that all the tasks you see in the below graph are multi-threaded. Check the data_processing project for a better understanding of the producer-consumer framework.

NLP pipelinepipeline

Contacts and licence

You can contact me at yui-mhcp@tutanota.com or on discord at yui#0732

The objective of these projects is to facilitate the development and deployment of useful application using Deep Learning for solving real-world problems and helping people. For this purpose, all the code is under the Affero GPL (AGPL) v3 licence

All my projects are "free software", meaning that you can use, modify, deploy and distribute them on a free basis, in compliance with the Licence. They are not in the public domain and are copyrighted, there exist some conditions on the distribution but their objective is to make sure that everyone is able to use and share any modified version of these projects.

Furthermore, if you want to use any project in a closed-source project, or in a commercial project, you will need to obtain another Licence. Please contact me for more information.

For my protection, it is important to note that all projects are available on an "As Is" basis, without any warranties or conditions of any kind, either explicit or implied. However, do not hesitate to report issues on the repository's project or make a Pull Request to solve it πŸ˜„

If you use this project in your work, please add this citation to give it more visibility ! πŸ˜‹

@misc{yui-mhcp
    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}
}

Acknowledgments

Thanks to @Ananas120 for his contribution and sharing his code !

Notes and references

All the citations for the master thesis are available in the CITATIONS file with links for papers. They can be good papers to start NLP and discovers the Transformers-based models which are omnipresent nowadays in NLP / NLU !

The thesis report is not published by the university yet.

About

NLP project

License:GNU Affero General Public License v3.0


Languages

Language:Python 96.2%Language:Jupyter Notebook 2.5%Language:TeX 1.2%Language:Makefile 0.1%Language:Dockerfile 0.0%