bho9668 / zbrain

Infrastructure useful to create natural language processing systems based on transformer networks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Zamia Brain

The Zamia Brain project provides infrastructure useful to create natural language processing systems based on transformer networks (see https://arxiv.org/abs/1706.03762).

This project is still highly experimental, everything is subject to change without prior notice. The current approach is to generate training corpora for pre-training as well as (multi-)domain refinement. The goal is to train networks that are very robust (i.e. avoid brittleness present in traditional rule-based systems) in their natural language processing capabilities (pretraining) while allowing for a certain amount of control of their behavior (refinement).

For this, you will find these components:

Corpora

Twitter

You will need to provide a list of accounts as input:

./twitterscrape.py -l de -s 2019-08-01 twitter_de_201908 -U user_stats_de.json twitter_sources_de.txt

Heise

Important: adapt hard-coded paths firs!

./qa_extract_heise.py

Parole

Important: adapt hard-coded paths firs!

./qa_extract_parole.py

Wikipedia

Important: adapt hard-coded paths firs!

./qa_extract_wikipedia.py -l de

Export for pretraining

This will work for GPT-2 as well as TransformerXL:

./qa_export_transformer-lm.py -o base_de heise parole twitter_de_2010 twitter_de_201907 wikipedia_de

Next, encode corpus the corpus using a sentencepiece tokenization model and run the pretraining.

Extract skills from Zamia AI

qa_extract_skills.py -l de skill_personal_de personal.xml

QA finetuning

-q option is important here to include dialog samples

./qa_export_transformer-lm.py -q -o qa_de twitter_de_2010 twitter_de_201907 skill_personal_de

Next, encode corpus the corpus using a sentencepiece tokenization model and continue the training for finetuning.

TODO: KB encoding / indexing / lookup

FIXME

Architecture

This is just a sketch of what a full QA chat bot with associative memory could look like:

                        +------------------+                                                        +------------------------------+
                        | Dialog Ctx     1 |                                                        |       Knowledge Base         |
                        | Dialog Ctx     2 |           +--------------------+                       +------------------------------+
                        | ...              |---------> | DeepNN_Context_Vec |                       | 0.1 0.2 0.01 ... | KB Line 1 |
                        | Dialog Ctx     n |           +--------------------+                       | ...              |           |
                        +------------------+                     |                                  | 0.33 0.1 0.5 ... | KB Line m |
                        | Current Input    |                     |                                  +------------------------------+
                        +------------------+                     |                                                 |
                                 |                               |                                                 |
                                 |                               |        +----------------------------------------+
                                 |                               |        |
                                 |                               |        |
                                 |                               v        v
                                 |                     +---------------------------+
                                 |                     | Nearest Neighbour search  |
                                 |                     +---------------------------+
                                 |                                   |
                                 |                                   |
                                 |                                   v
                                 |                  +--------------------------------+
                                 |                  |         KB Context Lines       |
                                 |                  +--------------------------------+
                                 |                  | 0.4 0.2 0.9 ... Info    line 1 |
                                 |                  | ...                            |
                                 |                  | 0.8 0.2 0.4 ... Info    line k |
                                 |                  +--------------------------------+
                                 |                                   |
                                 |                                   |
                                 |      +----------------------------+
                                 |      |
                                 |      |
                                 v      v
                          +-----------------------+
                          | Info    Line 1        |
                          | ...                   |
                          | Info    Line k        |
                          +-----------------------+
                          | Ctx     Line 1        |
                          | ...                   |
                          | Ctx     Line n        |
                          +-----------------------+
                          | Current Input         |
                          +-----------------------+
                                     |
                                     |
                                     v
                              +-------------+
                              |  DeepNN_QA  |
                              +-------------+
                                     |
                                     |
                              +-------------+
                              |   Response  |
                              +-------------+


[ Knowledge + Dialog History + Current Input ] -> [ Response ]

Knowledge Base

Dialog

<pre> DS_i → data/qa_src/DS_i/#.json \ DS_j → data/qa_src/DS_j/.json | . \ data/qa_enc/train/.json . / data/qa_enc/val/.json . | DS_n → data/qa_src/DS_n/##.json / </pre>

Datasets

Dialog Corpora

Chat Corpora

  • Zamia AI

  • 74M AIML bots

  • 142M chat_corpus https://github.com/Marsan-Ma-zz/chat_corpus https://github.com/Marsan-Ma/twitter_scraper

               34M open subtitles
               21M twitter_en
    *   41M    cornell_movie_dialogs_corpus
    *   33M    cornell_movie_quotes_corpus.zip
    *    0.2M  Microsoft Research Social Media Conversation Corpus
    *    4.3M  swb1_dialogact_annot.tar.gz
    * 7800M    The Ubuntu Dialogue Corpus v1.0
    *          NPS Chat Corpus (NLTK)
    *          Internet archive Twitter stream https://archive.org/search.php?query=collection%3Atwitterstream&sort=-publicdate&page=2
    *   58M    chatterbot-logs

Knowledge

  • WikiData

  • conceptnet5

  • framenet_v15

  • HappyDB

  • linkedgeodata

  • nell

  • opencyc

  • SemLink

  • SUMO

  • UMBEL

  • weather

  • wordnet

About

Infrastructure useful to create natural language processing systems based on transformer networks


Languages

Language:Python 99.5%Language:Shell 0.5%