Efficient PyTorch Operator Inventory (EPOI)

This inventory includes efficient PyTorch custom operators for training. It also inclues a benchmark suite to easily evaluate their latencies and memory usages.

Requirements

The covered operators may have dependencies to other libraries listed as follows. It is recommended to intall all of them to obtain a complete benchmark. However it's fine if you just want to benchmark the operators from certain libraries.

HuggingFace transformers (https://github.com/huggingface/transformers): Any installation works.
NVIDIA Apex (https://github.com/NVIDIA/apex): Clone and use setup.py to build from source.
Megatron-LM (https://github.com/NVIDIA/Megatron-LM): Clone and add the path to PYTHONPATH.
xFormers (https://github.com/facebookresearch/xformers): Clone and use setup.py to build from source. Verified commit: 48a77cc

Inventory

You can easily use the covered operators in your PyTorch models:

import torch
from epoi.ops.xformers_attn import BertSelfAttention as EpoiBertSelfAttention

class Model(torch.nn.Module):
      def __init__(self):
            super().__init__()
            self.attn = EpoiBertSelfAttention(...)

      def forward(self, hidden_states):
            out = self.attn(hidden_states)
            ...

Benchmarking

Note that you need to install the corresponding packages (e.g., apex) to import/benchmark certain operators.

python -m epoi.benchmark

This will benchmark all included operators on your local GPUs. The full benchmark results can be found here.

In addition, the following flags may also useful:

--only-run op1,op2: Only benchmark the ops with op1 OR op2 in their names. You can use comma to specify more ops at once.

--forward-only: The deafult benchmark includes a forward and a backward. If you only want to benchmark the forward part, specify this flag in your command.

--verbose: You may find some ops failed to be benchmarked. Possible reasons include out of memory or missing dependencies (e.g., apex, triton, xformers, etc). In this case, you can use this flag to see a complete error message for debugging.

Example (on NVIDIA V100):

python -m epoi.benchmark --only-run gpt_attention

===== Environment =====

GPU: Tesla V100-SXM2-16GB

PyTorch Configuration
   Config         Value
-------------  ------------
   Version     1.12.1+cu116
Built w. CUDA      11.6


Other Libraries Configuration
  Package       Version                   Commit SHA
------------  -----------  ----------------------------------------
    epoi        0.1.dev    094608d0759392516d5c6b4e00e00e72b3156c1c
transformers  4.24.0.dev0  12ce2941c7b67c0dedac0f0468b3ed854fa940ab
  xformers    0.0.14.dev   ba93c5012d00bd1b010514a7bc9bd938c1ad6149
   triton        2.0.0                       N/A
    apex          0.1                        N/A
===== Environment =====

[2022-10-28 00:35:18] INFO main: Skipped bias_gelu
[2022-10-28 00:35:18] INFO main: Skipped dropout_add_ln
[2022-10-28 00:35:18] INFO main: Skipped bert_attention
[2022-10-28 00:35:18] INFO main: Selected gpt_attention
[2022-10-28 00:35:18] INFO main: Skipped qkv_self_attn
[2022-10-28 00:35:18] INFO main: Skipped layer_norm
[2022-10-28 00:35:18] INFO main: Skipped softmax
[2022-10-28 00:35:18] INFO main: Running selected 1/7 cases
[2022-10-28 00:35:18] INFO main: [1/1] Benchmarking gpt_attention
[2022-10-28 00:35:23] INFO bencher: Correctness checking for xFormers FlashAttn (cutlass) is passed
[2022-10-28 00:35:23] WARNING bencher: Skip correctness checking for xFormers FlashAttn (triton): Forward failed
[----- GPT Attention (Attn) and FlashAttention (FA) without mask ------]
                                  |  HF (Attn)  |  xFormers cutlass (FA)
1 threads: -------------------------------------------------------------
      (8, 1024, 1024, 16, 50257)  |     14.9    |            6.0
      (16, 512, 8192, 64, 50264)  |    184.4    |          164.8
      (4, 2048, 8192, 64, 50264)  |    261.3    |          197.9

Times are in milliseconds (ms).

          Shape              HF (Attn)    xFormers cutlass (FA)
--------------------------  -----------  -----------------------
(8, 1024, 1024, 16, 50257)     1091              178.502
(16, 512, 8192, 64, 50264)    2688.27            1284.02
(4, 2048, 8192, 64, 50264)    8836.02            1284.02

Memory is in MBs and excludes inputs/outputs.

Module Injection

EPOI also provides two approaches for you to inject the covered moduels to your model as long as your model has a corresponding policy that specifies how to inject modules.

If your model doesn't have a builtin injection policy, you could also custom one and register it:

from epoi.inject import register_policy, ModuleInjectPolicy

@register_policy
class MyPolicy(ModuleInjectPolicy):
  # Implement you policy (tutorial TBA)

Module Injection after Initialization

If you prefer to initialize the model first, you could inject modules as follows:

from epoi.inject import inject_module

model = init_model()
inject_module(model)

You can refer to this Jupyter notebook that uses this approach to inject modules to GPT2-medium.

Module Injection during Initialization

Note that this approach doesn't support model loading from a checkpoint yet.

If you have to inject modules during model initialization (e.g., train the model with ZeRO-3), you could inject modules as follows:

from epoi.inject import InjectModuleContext

with InjectModuleContext():
  ...
  model = init_model()

You can refer to this Jupyter notebook that uses this approach to inject modules to GPT2-xl and trains with DeepSpeed ZeRO-3.

chhzh123 / epoi