YATO: Yet Another deep learning based Text analysis Open toolkit

Quick Links

Introduction

Introduction

YATO, an open-source Python library for text analysis. In particular, YATO focuses on sequence labeling and sequence classification tasks, including extensive fundamental NLP tasks such as part-of-speech tagging, chunking, NER, CCG supertagging, sentiment analysis, and sentence classification. YATO can design both specific RNN-based and Transformer-based through user-friendly configuration and integrating the SOTA pre-trained language models, such as BERT.

YATO is a PyTorch-based framework with flexible choices of input features and output structures. The design of neural sequence models with YATO is fully configurable through a configuration file, which does not require any code work.

Its previous version called NCRF++ has been accepted as a demo paper by ACL 2018. The in-depth experimental report based on NCRF++ was accepted as the best paper by COLING 2018.

Compared with NCRF++, the highlight of YATO is the support for Pre-trained Language Model and sentence classification tasks.

Welcome to star this repository!

Getting Started

We provide an easy way to use the toolkit YATO from PyPI

pip install ylab-yato

Or directly install it from the source code

git clone https://github.com/jiesutd/YATO.git

The code to train a Model

from yato import YATO
model = YATO(configuration file)
model.train()

The code to decode prediction files:

from yato import YATO
decode_model = YATO(configuration file)
result_dict = decode_model.decode()

return dictionary contents following value:

speed: decoding speed
accuracy: If the decoded file contains annotation results, accuracy means verifying the accuracy
precision: If the decoded file contains annotation results, precision means verifying the precision
recall: If the decoded file contains annotation results, recall means verifying the recall
predict_result: predicted result
nbest_predict_score: nbest scores of decoded prediction
label: Mapping between labels and indexes

Data Format

Refer sample_data for the detailed data format.
YATO supports both BIO and BIOES(BMES) tag schemes.
Notice that IOB format (different from BIO) is currently not supported, because this tag scheme is old and works worse than other schemes Reimers and Gurevych, 2017.
The differences among these three tag schemes are explained in this paper.
We provided a script which supports convertation of tag scheme among IOB/BIO/BIOES. Welcome to have a try.

Configuration Preparation

You can specify the model, optimizer, and decoding through the configuration file:

Training Configuration

Dataloader

train_dir=the path of the train file
dev_dir=the path of the validation file
test_dir=the path of the test file
model_dir=the path to save model weights
dset_dir=the path of configuration encode file

Model

use_crf=True/False
use_char=True/False
char_seq_feature=GRU/LSTM/CNN/False
use_word_seq=True/False
use_word_emb=True/False
word_emb_dir=The path of word embedding file
word_seq_feature=GRU/LSTM/CNN/FeedFowrd/False
low_level_transformer=pretrain language model from huggingface
low_level_transformer_finetune=True/False
high_level_transformer=pretrain language model from huggingface
high_level_transformer_finetune=True/False
cnn_layer=layer number
char_hidden_dim=dimension number
hidden_dim=dimension number
lstm_layer=layer number
bilstm=True/False

Hyperparameters

sentence_classification=True/False
status=train/decode
dropout=Dropout Rate
optimizer=SGD/Adagrad/adadelta/rmsprop/adam/adamw
iteration=epoch number
batch_size=batch size
learning_rate=learning rate
gpu=True/False
device=cuda:0
scheduler=get_linear_schedule_with_warmup/get_cosine_schedule_with_warmup
warmup_step_rate=warmup steo rate

Decode Configuration

status=decode
raw_dir=The path of decode file
nbest=0 (NER)/1 (sentence classification)
decode_dir=The path of decode result file
load_model_dir=The path of model weights
sentence_classification=True/False

Performance

For multiple sequence labeling and sequence classification tasks, YATO has reproduced or outperformed the reported SOTA results on majority datasets.

By default, the LSTM is a bidirectional LSTM. The BERT-base is huggingface's bert-base-uncased. The RoBERTa-base is huggingface's roberta-base. The ELECTRA-base is huggingface's google/electra-base-discriminator.

Results for sequence labeling tasks.

ID	Model	CoNLL2003	OntoNotes 5.0	MSRA	Ontonotes 4.0	CCG
1	CCNN+WLSTM+CRF	91.00	81.53	92.83	74.55	93.80
2	BERT-base	91.61	84.68	95.81	80.57	96.14
3	RoBERTa-base	90.23	86.28	96.02	80.94	96.16
4	ELECTRA-base	91.59	85.25	96.03	90.47	96.29

Results for sequence classification tasks.

ID	Model	SST2	SST5	ChnSentiCorp
1	CCNN+WLSTM	87.61	43.48	88.22
2	BERT-base	93.00	53.48	95.86
3	RoBERTa-base	92.55	51.99	96.04
4	ELECTRA-base	94.72	55.11	95.96

For more details, you can refer to our papers mentioned below.

The results based on Pretrain Language Model were recorded in YATO: Yet Another deep learning based Text analysis Open toolkit

Add Handcrafted Features

YATO has integrated several SOTA neural character sequence feature extractors: CNN (Ma .etc, ACL16), LSTM (Lample .etc, NAACL16) and GRU (Yang .etc, ICLR17). In addition, hand-crafted features have been proven to be important in sequence labeling tasks. YATO supports users designing their features such as Capitalization, POS tag, or any other features (green circles in the above figure). Users can configure the self-defined features through a configuration file (feature embedding size, pretrained feature embeddings .etc). The sample of input format is given at train.cappos.bmes, which includes two hand-crafted features [POS] and [Cap]. ([POS] and [Cap] are two examples, you can set your feature any name you want, just follow the format [xx] and configure the feature with the same name in the configuration file.) Users can configure each feature in configuration file by using.

feature=[POS] emb_size=20 emb_dir=%your_pretrained_POS_embedding
feature=[Cap] emb_size=20 emb_dir=%your_pretrained_Cap_embedding

The feature without pretrained embedding will be randomly initialized.

Speed

YATO is implemented using a fully batch computing approach, making it quite efficient in both model training and decoding.

With the help of GPU (Nvidia RTX 2080ti) and large batches, models built with YATO can be decoded efficiently.

N best Decoding

The traditional CRF structure decodes only one label sequence with the largest probabilities (i.e. 1-best output). In contrast, YATO can decode n label sequences with the top n probabilities (i.e. n-best output). The nbest decoding has been supported by several popular statistical CRF frameworks.

Text Attention Heatmap Visualization

YATO takes the list of words and the corresponding weights as input to generate Latex code for visualizing the attention-based result.

The Latex code will generate a separate .pdf visualization file.

For example,

from yato import YATO
from utils import text_attention
model = YATO(decode configuration file)

sample = ["a fairly by-the-books blend of action and romance with sprinklings of intentional and unintentional comedy . ||| 1"]
probsutils, weights_ls = model.attention(input_text=sample)
sentece = "a fairly by-the-books blend of action and romance with sprinklings of intentional and unintentional comedy . "
atten = weights_ls[0].tolist()

text_attention.visualization(sentece, atten[0], tex = 'sample.tex', color='red')

Reproduce Paper Results and Hyperparameter Tuning

To reproduce the results of CoNLL2003 and SST2 in our paper, you only need to use the configuration file bert_base_conll2003.config, bert_base_gelu_sst2.config and then configure your file directory.

The default configuration file adopts the Pure PLM model, and you can develop your model by modifying the configuration accordingly.

If you want to use this framework in new tasks or datasets, here are some tuning tips by @Victor0118.

Report Issue or Problem

If you want to report an issue or ask a problem, please attach the following materials if necessary. With these information, we can provide a fast and accurate discussion and the corrsponding suggestions.

log file
config file
sample data

Cite

If you use YATO++ in your paper, please cite our paper:

@inproceedings{yang2022yato,
title={YATO: Yet Another deep learning based Text analysis Open toolkit},
author={Wang, Zeqiang and Wang, Yile and Wu, Jiageng and Teng, Zhiyang and Yang, Jie},
year={2022}
}

Future Plan

Support API usage
Release trained model on various sequence labeling and classification tasks

Updates

2022-May-14 YATO, init version
2020-Mar-06, dev version, sentence classification, framework change, model saved in one file.
2018-Dec-17, NCRF++ v0.2, support PyTorch 1.0
2018-Mar-30, NCRF++ v0.1, initial version
2018-Jan-06, add result comparison.
2018-Jan-02, support character feature selection.
2017-Dec-06, init version

jiesutd / YATO