A Fast Muti-processing BERT_Inference System
代码地址https://github.com/qsyao/cudaBERT 走过路过star一个哈
- BERT Encoder Backend is implemented by CUDA and has been optimized (Using Kernel Fusion etc)
- Frontend is implemented by python, for pruning useless sequence length at the end of string.(disabled by mask)
- Tokenlizer and additional layer for BERT_Encoder is implemented by Pytorch, users can define their own additional layers.
4x Faster than Pytorch:
10W lines DataSet on GTX 1080TI (Large model, Seq_length = 200)
pytorch | CUDA_BERT |
---|---|
2201ms | 506ms |
Constraints
- Nvidia GPUS && nvidia-drivers
- CUDA 9.0
- Cmake > 3.0
- Weights of BERT must be named Correctly (Correct name in name.txt), also correct_name.npy can be generate by checkpoints from tf_bert and torch_bert
How to Use
Step 1
Make libcudaBERT.so
- Go to $(Project)/cuda_bert/cuda_bert
- cmake . && make -j8
Step 2
-
Prepare vocab.txt(tokenlizer needed) in ${Project}/model_dir (or input manually)
-
Prepare checkpoints and bert_config_file from tensorflow or pytorch in ${Project}/model_dir (or input manually)
-
Prepare weights and bias in ${Project}/model_npy
python convert_pytorch_model_to_npys.py --bert_config_file model_dir/bert_config.json --init_checkpoint model_dir/pytorch_model_v5.bin --output_dir model_npy
(or convert_tf_ckpt_to_npys.py )
Step 3
Define your own functions:
- Custom finetune layer: In apps/finetune.py , take output numpy.array from bert : [batchsize, hidden_size]
class torch_classify(nn.Module):
def __init__(self, num_classes, hidden_size):
super(torch_classify, self).__init__()
self.linear = nn.Linear(hidden_size, num_classes)
self.softmax = nn.Softmax(-1)
def forward(self, pooler_out):
return self.softmax(self.linear(pooler_out))
- Your own Tokenlizer functions(define in tokenlizer.py) to process lines of your own input_file to a tuple(Noted in tokenlizer.py), Prepare line_index, line_data(raw string), segment_id, input_id and mask.
def tokenlizer_line(max_seq_length, line, index):
pass
return (id_line,
line_raw_data,
input_ids,
input_mask,
segment_ids)
- Your funcitons to write line to output_file(defined in example.py), it takes the raw_line and your output_string as input and returns a string.
def output_line(line_data, output):
'''
define by Users to write results to output
line_data (string): what user use for raw line
output (string): computation results of bert + custom_layer
'''
return line_data + '\t' + str(output)
Step 4
New class engine , config and set cuda_model, custom_layer, preproecess_function, output_line and config of engine(Noted in config.py) in example.py
The defalt value and meaning of configs are set at config.py.
from cuda_bert.engine import Engine
from cuda_bert.cuda_model import Cuda_BERT
if __name__ == "__main__":
'''Set Config'''
config = Engin_Config()
config.batchsize = 128
config.model_npy_pth = args.model_npy_pth
runtime = Engine(config)
runtime.set_cuda_model(Cuda_BERT)
runtime.set_finetune_layer(Finetune_Layer)
runtime.set_tokenlizer_function(tokenlizer_line)
runtime.set_output_function(output_line)
runtime.run(args.input_file, args.output_file)
Run example.py and Input your GPU_ID by --gpu 0 1 2 3
Example
After Step 1 and Step2, we release an example to process ./apps/data/example.tsv to ./apps/data/example.tsv. (Step 3 is set to deal with input file)
The additional layer is Linear + Softmax
cd apps
python example.py --input_file ./data/small_v6_label_data.tsv --output_file ./data/test.tsv --gpu 0
Name.txt
Described in name.txt, and names can't be diffence from names in Name.txt;
Names of other layers are like layer_0
Retraining
We release a branch for retraining (by cuda),but it is hard to use for real dataset. This is more about testing code run time. Our retraining code run 30% faster than pytorch and tensorflow.