share424 / attention-based-question-answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Attention Based Question Answering

This project uses CoQa datasets. This dataset is for conversational question answering, but I make this project just for simple question answering.

Model Pipeline

This project consists of three modules, Question Analysis, Passage Retriever, and Answer Finder

The pipeline will receive two inputs, the question and the context of the question (e.g. news, article, etc). The question will be feed to the Question analysis and the context will be split by sentence and feed to the passage retriever to get the proposed sentence that contains the answer. Finally, the answer finder will receive the question and the proposed sentence as input and generate a predicted answer.

Question Analysis

The point of the question analysis is to extract information from the question, you can use NER (Named entity recognition) to do this, but in this project I use SBert to get the embedding of the question.

# e.g.
question_encoder = SentenceTransformer('facebook-dpr-question_encoder-multiset-base')
question = "Where did she live?"
question_embedding = question_encoder.encode(question)
print(question_embedding)

# example output
# [ 4.06161875e-01 -1.38373017e-01 -1.14733957e-01  2.26605639e-01 ... ]

Passage Retriever

The purpose of this module is to propose a sentence from the context texts that contain the answer of the question. This module will receive the embedding of the question and the embedding of every sentence in the given context. And finally calculate the cosine similarity to get the proposed sentence. The proposed sentence will be used as input in the answer finder. To get better accuracy, I use the top-3 sentence as the input, so we will get the 3 best answers.

# e.g.
context_encoder = SentenceTransformer('facebook-dpr-ctx_encoder-multiset-base')
context = "this is long article that contains the answer..."
# split the context by sentence
sentences = sent_tokenize(context)
# calculate the context embedding
context_embedding = context_encoder.encode(sentences)
# calculate similarities
similarities = util.pytorch_cos_sim(question_embedding, story_embedding).numpy()
# sort similarities
sorted_arg = np.argsort(similarities, axis=-1)[0][::-1]
# print the top-3 sentence
print(sorted_arg[:3])

Answer Finder

This is the main module to generate the answer. This module will receive the question and proposed sentence that may contain the answer.

this module consist of 3 step, Preprocessing, encoder, and decoder

Preprocessing

This module will concat the question and the proposed sentence into one input text.

# e.g.
question = "where did she live?"
proposed_sentence = "in a barn near a farm house, there lived a little white kitten"
# concat the question and proposed sentence with <sep> token
input_text = "<start> " + question + " <sep> " + proposed_sentence + " <end>"
print(input_text)
# output: <start> where did she live? <sep> in a barn near a farm house, there lived a little white kitten <end>

Every number contains in the question or proposed sentence will be extracted

text = "there are 5 dogs in the house"
output, numbers = preprocess_sentence(text)
print(output)
# output: <start> are <number> dogs in the house <end>
print(numbers)
# output: ['5']

this number will be feed to the number decoder to get the correct number for the output

Encoder and Decoder

This module is responsible to get the context of the question and generate the answer using attention

I use BahdanauAttention for the attention layer, you can check the paper for the detailed implementation. Every generated <number> token will be passed to the number decoder to get the correct number. The number decoder will receive the hidden state and the extracted number from the preprocessing step and output the correct number.

Note: You can implement the number decoder mecanism for PERSON, LOCATION, and ORGANIZATION as well

Example Output

Results

The model get 74.16 for the BLEU-4 score

Pretrained Models

you can download the pretrained models here

Reference

  1. Machine Translation
  2. SBert for the sentence embedding
  3. CoQa

About


Languages

Language:Jupyter Notebook 100.0%