[FEATURE] CPU inference benchmarks

Question

[FEATURE] CPU inference benchmarks

RuRo opened this issue 4 months ago · comments

Would you be interested in adding reference benchmark results for inference on the CPU (similar to the GPU benchmarks currently provided in results/benchmark-*.csv)?

I know, that pytorch inference on the CPU is bad, however it can still be useful to quickly compare different models (under the assumption that pytorch CPU inference badness is model-independent). I think, you can get a "decent" relative approximation of real world CPU performance by jit.traceing the models and then measuring single threaded CPU throughput. Alternatively, you could export the models to ONNX and measure the CPU inference times using something like ONNXRuntime.

Ross Wightman · Answer 1 · Tue Mar 19 2024 03:15:14 GMT+0800 (China Standard Time)

The native eager PyTorch CPU performance is generally quite awful compared to optimized.... ONNX + optimizations, XLA, or traced are better. The performance can vary significantly between those and varies with model. Even with optimization though, the performance is low enough that if all of the large models are included, it would take forever to run. So would probably need to be limited to batch size 1 or something, maybe figure out a reasonable subset of models.

In any case, I did think about this was starting to get pretty involved in just figuring out how to approach it, so tackling it ended up quite low on the priority list.

ruro · Answer 2 · Tue Mar 19 2024 05:56:52 GMT+0800 (China Standard Time)

Actually, I'd say that using batch_size = 1 for CPU benchmarking might even be desirable. Also, I would generally be okay with excluding models above a certain size. It seems to me, that in most cases CPU performance characteristics would be mainly relevant for on-line/on-device processing, where having ultra-low latency is more important compared to throughput and so batching and large models are rarely used.

I think that it would be really useful to have a publicly available .csv that you could use as a starting point for model selection. IMHO, simply benchmarking jit.traced native pytorch inference with batch_size = 1 and torch.set_num_threads(1) on some generic X86-64 CPU would be sufficient as a proxy for "latency on modern x86-64 CPUs".

I don't think that we need to get 100% "optimized" timing results here. Our benchmarks just need to rank the different models in "roughly" the same order in terms of speed/latency as what you'd get with ONNXRuntime or OpenVINO or whatever. Different CPU architectures, different CPU models and different execution engines will have different performance characteristics, so for any accurate timing information the end users would still need to benchmark the chosen models on their production hardware, using their production execution engines. AFAIK, that this is currently the case even with the provided GPU benchmarks (PyTorch vs TensorRT, Hopper vs Ampere vs Turing, batching vs no batching, available VRAM, etc).

IMHO, the benchmarks provided here should be primarily used for

getting (very) rough estimates of each models' performance (Fermi-style "order of magnitude" back-of-the-envelope estimates)
roughly comparing different models (so that the user can save time by ignoring obviously non-Pareto-optimal models during their "proper" benchmarking)

Having only GPU benchmarks is a bit problematic, since (afaik) modern GPU and CPU architectures are sufficiently different to the point that you can't expect some of the GPU-optimized models to also be fast on the CPU (and vice versa). On the other hand, although the different x86-64 CPUs have their quirks, they are significantly more similar to each other than to GPUs.

Ross Wightman · Answer 3 · Wed Mar 20 2024 03:44:40 GMT+0800 (China Standard Time)

@RuRo I kicked off a benchmark.py --device cpu --bench infer -b 1 --torchcompile run across all models on an Intel processor, a few gen old workstation CPU (Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz). Will see how that goes, on recent pytorch the inductor torch.compile backend is pretty good improvement for CPU.

Could do a bfloat16 run w/ ipex at some point, but would need to find a suitable machine on the cloud as only recent processors support that.

Ross Wightman · Answer 4 · Wed Mar 20 2024 23:45:30 GMT+0800 (China Standard Time)

Initial run done...

https://github.com/huggingface/pytorch-image-models/blob/main/results/benchmark-infer-fp32-nchw-pt221-cpu-i9_10940x-dynamo.csv

ruro · Answer 5 · Mon Apr 08 2024 22:40:47 GMT+0800 (China Standard Time)

@rwightman Hi, I was looking through the timing results and found that some of the models perform much worse than what I'd expect. Did you run the inference benchmarks without --reparam? This is probably unfair to models who expect to be reparametrized before inference?