[Benchmark] Improve NLP Backbone Benchmark

Question

[Benchmark] Improve NLP Backbone Benchmark

sxjscience opened this issue 4 years ago · comments

Description

In GluonNLP, we introduced the benchmarking script in https://github.com/dmlc/gluon-nlp/tree/master/scripts/benchmarks.

The goal is to track the training + inference latency of common NLP backbones so that we can choose the appropriate ones for our task. This will help users train + deploy models with AWS.

Currently, we covered:

Huggingface/Transformer-based backbone with FP32 + FP16 training / inference. For FP16 training, we are not profiling against the AMP-based solution so this gives an edge of pytorch, in which we need to fix
MXNet 2.0-nightly version (only for community use) + GluonNLP 1.0 with FP32 + FP16 (amp) training / inference.
TVM FP32 inference. Due to some recent upgrade of the code base, this is currently broken.

I will share the following action items that I feel are worthwhile doing:

Short-term Bug-fix + Improvement

Fix the FP16 training benchmark in Huggingface/Transformer to use AMP in PyTorch
Fix the TVM benchmark. This is also tracked in #1425
Add FP16 inference to TVM benchmark.
Turn on einsum acceleration in MXNet-based benchmark. This is added in apache/mxnet#18921

Automation + Visualization

Support launching benchmark job with AWS Batch. Currently tracked in #1471.
Automate benchmarking process via Github actions.
Support visualization of benchmark results

Longer-term Backbone Benchmarking Effort

Add JAX/flax-based solution, which is internally using XLA.
Support AutoScheduler in TVM benchmark
Enable ONNX + TensorRT. This is considered the fastest solution for conducting NLP inference.

Other longer-term efforts

Support benchmarks for Data-loaders.
Support common end-to-end training benchmarks like the SQuAD 2.0 finetuning. We may focus on single-instance-based benchmarks.

@dmlc/gluon-nlp-committers