attention-is-all-you-need machine-transla natural-language-processing tensorflow transformer

Transformer - Attention Is ALL You Need

A Tensorflow implementation of Transformer in version 1.12. The core fucntions in transformer such as scaled dot prodction attention, multi-head attention and feedforward network, were implemented in nn.py

For more details, read the paper: Ashish Vaswani, et al. "Attention is all you need."

Noticed: TF 1.x scripts will not continue to work with TF 2.0. Therefore, the new variants of transformer will move to Tensorflow 2.0.

Example for Multi-Head Attention

Prerequisites

Dependencies

TensorFlow >= 1.12
Python >= 3.6

Dataset

Noticed: 3000 exampels were used for my experiement. Dataset is not provided.

For training the model, source and target examples should be provided in data/src.train.example.txt and data/tgt.train.example.txt. Each source example is corresponding to same index in the target file.

In the reconstruction task, the encoder produces a low-dimension representation by taking source example. The decoder try to reconstruct original sentence by recieved the low-dimension code and target example.

You can replace dataset with parallel corpus for machine translation task. Concretely, the file of sources src.txt contains sentences of langauge A. Sentences of language B is in tgt.txt. Thing to be noticed: you should provide tow vocaburary files and modify codes for vocaburary creating.

Train

Parameters

To make sure this code is well implemented and trainable, I trained sentence reconstruction over a tiny Classical Chinese dataset with this repository. Therefore, the parameters were set to overfit on dataset.

Parameters	number
EPOCH	2000
BACTH SIZE	32
DROPOUT	0.1
NUM LAYERS	1
D MODEL	128
NUM HEADS	8
ENCODER SEQUENCE LENGTH	100
DECODER SEQUENCE LENGTH	100

Training

python train_transformer.py

Results

The vallina transformer consist of two attention-based netwoks: encdoer and decoder. That is similar architecture to autoencoder (Hinton & Salakhutdinov, 2006.). The experiement sugguests that transformer can be train on reconstruction task, both for short and long sequence.

short sentence (17 tokens)

source         > 秋天，吳國攻伐楚國，楚軍擊敗吳軍。

reconstruction > 秋天，吳國攻伐楚國，楚軍擊敗吳軍。<eos>

long sentece (62 tokens)

source         > 樊穆仲說：魯懿公之弟稱，莊重恭敬地事奉神靈，敬重老人；處事執法，必定諮詢先王遺訓及前 朝故事；不牴觸先王遺訓，不違背前朝故事。

reconstruction > 樊穆仲說：魯懿公之弟稱，莊重恭敬地事奉神靈，敬重老人；處事執法，必定諮詢先王遺訓及前朝故事；不牴觸先王遺訓，不違背前朝故事。<eos>

Notes

construct the mask correctly.
schedule learning rate is must.

Implementation Reference

About

A Tensorflow implementation of Transformer in version 1.12

attention-is-all-you-need machine-transla natural-language-processing tensorflow transformer

Languages

Language:Python 100.0%