NOTE: stable-fast
is currently only in beta stage and is prone to be buggy, feel free to try it out and give suggestions!
stable-fast
is an ultra lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
stable-fast
provides super fast inference optimization by utilizing some key techniques and features:
- CUDNN Convolution Fusion:
stable-fast
implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations ofConv + Bias + Add + Act
computation patterns. - Low Precision & Fused GEMM:
stable-fast
implements a series of fused GEMM operators that compute withfp16
precision, which is fast than PyTorch's defaults (read & write withfp16
while compute withfp32
). - NHWC & Fused GroupNorm:
stable-fast
implements a highly optimized fused NHWCGroupNorm + GELU
operator with OpenAI'sTriton
, which eliminates the need of memory format permutation operators. - Fully Traced Model:
stable-fast
improves thetorch.jit.trace
interface to make it more proper for tracing complex models. Nearly every part ofStableDiffusionPipeline
can be traced and converted to TorchScript. It is more stable thantorch.compile
and has a significantly lower CPU overhead thantorch.compile
and supports ControlNet and LoRA. - CUDA Graph:
stable-fast
can capture the UNet structure into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. - Fused Multihead Attention:
stable-fast
just uses xformers and make it compatible with TorchScript.
- Fast:
stable-fast
is specialy optimized for HuggingFace Diffusers. It achieves a high performance across many libraries. - Minimal:
stable-fast
works as a plugin framework forPyTorch
. It utilizes existingPyTorch
functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
Performance varies very greatly across different hardware/software/platform/driver configurations. It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job. I have tested on some platforms before but the results may still be inaccurate.
This is my personal gaming PC😄. It has a more powerful CPU than those from cloud server providers.
Framework | SD 1.5 | SD 2.1 | SD XL (1024x1024) |
---|---|---|---|
Vanilla PyTorch (2.1.0+cu118) | 29.5 it/s | 32.4 it/s | 4.6 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 40.0 it/s | 44.0 it/s | 6.1 it/s |
AITemplate | 44.2 it/s | untested | untested |
OneFlow | 50.3 it/s | untested | untested |
AUTO1111 WebUI | 17.2 it/s | 15.2 it/s | 3.6 it/s |
AUTO1111 WebUI (with SDPA) | 24.5 it/s | 26.1 it/s | 4.3 it/s |
TensorRT (AUTO1111 WebUI) | 40.8 it/s | untested | untested |
Stable Fast (with xformers & Triton) | 49.7 it/s | 52.5 it/s | 8.1 it/s |
Framework | SD 1.5 | SD 2.1 | SD 1.5 ControlNet |
---|---|---|---|
Vanilla PyTorch (2.1.0+cu118) | 24.9 it/s | 27.1 it/s | 18.9 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 33.5 it/s | 38.2 it/s | 22.7 it/s |
AITemplate | 65.7 it/s | 71.6 it/s | untested |
OneFlow | 60.1 it/s | 12.9 it/s (??) | untested |
TensorRT | untested | untested | untested |
Stable Fast (with xformers & Triton) | 61.8 it/s | 61.6 it/s | 42.3 it/s |
(??): OneFlow seems to be not working well with SD 2.1
Framework | SD 1.5 | SD 2.1 | SD 1.5 ControlNet |
---|---|---|---|
Vanilla PyTorch (2.1.0+cu118) | 19.3 it/s | 20.4 it/s | 13.8 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 24.4 it/s | 26.9 it/s | 17.7 it/s |
AITemplate | untested | untested | untested |
OneFlow | 32.8 it/s | 8.82 it/s (??) | untested |
TensorRT | untested | untested | untested |
Stable Fast (with xformers & Triton) | 28.1 it/s | 30.2 it/s | 20.0 it/s |
(??): OneFlow seems to be not working well with SD 2.1
Framework | SD 1.5 |
---|---|
Vanilla PyTorch (2.1.0+cu118) | 22.5 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 25.3 it/s |
AITemplate | 34.6 it/s |
OneFlow | 38.8 it/s |
TensorRT | untested |
Stable Fast (with xformers & Triton) | 31.5 it/s |
Sorry, currently A100 is hard and expensive to rent from cloud server providers in my region. A few months ago I have tested this framework on A100 and the speed is around 61 it/s for SD 1.5. Detailed benchmark results will be available when I have the access to A100 again.
Model | Supported |
---|---|
Hugging Face Diffusers (1.5/2.1/XL) | Yes |
With ControlNet | Yes |
With LoRA | Yes |
Dynamic Shape | Yes |
UI Framework | Supported | Link |
---|---|---|
AUTOMATIC1111 | WIP | |
SD Next | WIP | |
ComfyUI | Yes | ComfyUI_stable_fast |
import torch
from diffusers import (StableDiffusionPipeline, EulerAncestralDiscreteScheduler)
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
CompilationConfig
)
def load_model():
# NOTE:
# You could change to StableDiffusionXLPipeline to load SDXL model.
# If the resolution is high (1024x1024),
# ensure you VRAM is sufficient, especially when you are on Windows or WSL,
# where the GPU driver may choose to allocate from "shared VRAM" when OOM would occur.
# Or the performance might regress.
# from diffusers import StableDiffusionXLPipeline
#
# model = StableDiffusionXLPipeline.from_pretrained(
# 'stabilityai/stable-diffusion-xl-base-1.0', torch_dtype=torch.float16)
model = StableDiffusionPipeline.from_pretrained(
'runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16)
model.scheduler = EulerAncestralDiscreteScheduler.from_config(
model.scheduler.config)
model.safety_checker = None
model.to(torch.device('cuda'))
return model
model = load_model()
config = CompilationConfig.Default()
# xformers and Triton are suggested for achieving best performance.
# It might be slow for Triton to generate, compile and fine-tune kernels.
try:
import xformers
config.enable_xformers = True
except ImportError:
print('xformers not installed, skip')
# NOTE:
# When GPU VRAM is insufficient or the architecture is too old, Triton might be slow.
# Disable Triton if you encounter this problem.
try:
import triton
config.enable_triton = True
except ImportError:
print('Triton not installed, skip')
# NOTE:
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# My implementation can handle dynamic shape with increased need for GPU memory.
# But when your GPU VRAM is insufficient or the image resolution is high,
# CUDA Graph could cause less efficient VRAM utilization and slow down the inference,
# especially when on Windows or WSL which has the "shared VRAM" mechanism.
# If you meet problems related to it, you should disable it.
config.enable_cuda_graph = True
compiled_model = compile(model, config)
kwarg_inputs = dict(
prompt=
'(masterpiece:1,2), best quality, masterpiece, best detail face, lineart, monochrome, a beautiful girl',
# NOTE: If you use SDXL, you should use a higher resolution to improve the generation quality.
height=512,
width=512,
num_inference_steps=30,
num_images_per_prompt=1,
)
# NOTE: Warm it up.
# The first call will trigger compilation and might be very slow.
# After the first call, it should be very fast.
output_image = compiled_model(**kwarg_inputs).images[0]
# Let's see the second call!
output_image = compiled_model(**kwarg_inputs).images[0]
NOTE: stable-fast
is currently only tested on Linux
and WSL2 in Windows
.
You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).
I only test stable-fast
with torch==2.1.0
, xformers==0.0.22
and triton==2.1.0
on CUDA 12.1
.
Other versions might build and run successfully but that's not guaranteed.
# Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas
# Install PyTorch with CUDA and other packages at first
pip3 install 'torch>=1.12.0' 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0'
# (Optional) Makes the build much faster
pip3 install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)
NOTE: Any usage outside sfast.compilers
is not guaranteed to be backward compatible.
NOTE: To get the best performance, xformers
and OpenAI's triton>=2.1.0
need to be installed and enabled.
You might need to build xformers
from source to make it compatible with your PyTorch
.
# TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...
import packaging.version
import torch
if packaging.version.parse(torch.__version__) >= packaging.version.parse('1.12.0'):
torch.backends.cuda.matmul.allow_tf32 = True
Dynamic code generation is usually the cause for slow compilation. You could disable features related to it to speed up compilation. But this might slow down your inference.
# Wrap your code in this context manager
with torch.jit.optimized_execution(False):
# Do your things
config.enable_triton = False
When your GPU VRAM is insufficient or the image resolution is high, CUDA Graph could cause less efficient VRAM utilization and slow down the inference.
config.enable_cuda_graph = False