throughput of nat_tiny vs resnet50

Question

throughput of nat_tiny vs resnet50

jiaojile opened this issue 2 years ago · comments

hi~ I find that the same size models, nat_tiny and resnet50, have very different throughput on NVIDIA GeForce 2080Ti? How about the comparison in your machine?

(Plz don't care about the accuracy in the image, the input is not the ImageNet test set)

Ali Hassani · Answer 1 · Wed Jun 01 2022 00:39:39 GMT+0800 (China Standard Time)

Hello and thank you for your interest.
First off, most of our benchmarking was done on A100s, not 2080 Tis, so I wouldn't expect to get really excellent performance. We basically debugged and developed the kernel on a different architecture.

Secondly, I noticed you're not using mixed precision, which when enabled can really push both models further, but it doesn't have the same effect on two different models.

Thirdly, I can confirm that I got around the same throughput on NAT-Tiny with a 2080, which is around 340 imgs/sec with the default batch size, but I only got 666 imgs/sec with ResNet50, so not sure what's going on there. Not too surprising though, since that time includes I/O overheads typically, so it might be a little different depending on your setup.

Also, just fyi, this is how they'll run with AMP:

Notice how little it affected ResNet50, but how much closer NAT-Tiny is now to ResNet50 in throughput.
It's not really amp's shortcoming either, it's mostly that the modules ResNet50 primarily uses probably have really good full precision modes that half precision just doesn't end up making that big of a difference.

If your question was why it's slower now in general, let me know and I'll get into details.

jiaojile · Answer 2 · Wed Jun 01 2022 09:47:46 GMT+0800 (China Standard Time)

Thank you for the quick and detailed answer.
The throughput improvement for both models by using mixed precision are as follows on my 2080Ti. Despite little effect on ResNet50, it's still nearly twice as fast as NAT-Tiny. I wonder what is the throughput comparison on A100s?
NAT-Tiny: from 340 imgs/sec to 550 imgs/sec
ResNet50: from 890 imgs/sec to 1050 imgs/sec

BTW, have you tried to do INT8 quantization with NVIDIA TensorRT for NAT?

Ali Hassani · Answer 3 · Wed Jun 01 2022 11:14:17 GMT+0800 (China Standard Time)

Again NAT's running similarly on my end, but I'm not sure why I can't run ResNet as fast. It could be a wide range of reasons varying from CUDA version, torch version, and so on. Even hardware to be honest.

Either way, the fact that ResNet50 is ahead is not too surprising, ResNet's using mainly convs which usually use either PyTorch's kernels or cuDNN's, both of which are much faster and more optimized than the current version of NAT.

Ali Hassani · Answer 4 · Sat Jun 18 2022 03:06:11 GMT+0800 (China Standard Time)

Closing this due to inactivity. If you still have questions feel free to open it back up.