Slower inference results for BLOOM fp16 on identical hardware

Question

Slower inference results for BLOOM fp16 on identical hardware

sarthaklangde opened this issue 2 years ago · comments

Hey,

Thank you for the scripts for loading checkpoints and running benchmarks. I have a strange issue that ds_inference fp16 throughput is quite slower than the results mentioned. But, the int8 benchmark results are almost identical.

Environment:
GCP a2-ultragpu-8g with A100 8x80GB, 1.3 TB Memory, 96 vCPUs
Debian 11

For fp16 & batch size 1, the throughput I receive is 67 msecs/token while it should be possible to get 44 msecs/token. This trend is repeated for higher batch sizes too.

But for int8, the results are exactly the same as the one mentioned in benchmarks (both for 8x80GB and 4x80GB).

What have I tried until now?

Different CUDA versions (11.0, 11.4, 11.6, 11.7), PyTorch versions, DeepSpeed versions (0.7.0, 0.7.2, 0.7.3)
Reinstalling environment from scratch on a new server

Any idea on what I might be doing wrong? Or is everybody else experiencing similar throughput?

Mayank Mishra · Answer 1 · Tue Sep 13 2022 23:32:18 GMT+0800 (China Standard Time)

@sarthaklangde i have the same issue.
I believe this might be due to the internal pcie tree implementation being different.
@stas00 fyi

Mayank Mishra · Answer 2 · Tue Sep 13 2022 23:32:57 GMT+0800 (China Standard Time)

I dont believe it has anything to do with your environment

Stas Bekman · Answer 3 · Tue Sep 13 2022 23:42:54 GMT+0800 (China Standard Time)

My tests were run on JeanZay HPC so it's possible their servers are somehow beefier hardware-wise?

It is interesting that you both report the same speed with int8.

@RezaYazdaniAminabadi, do you by chance have any insights at why this might be the case? What throughput do you get on your Azure nodes for bs=1 so that we have another point of comparison.

There are 2 hardware versions of A100. Do we know if A100s are all SXM and not PCIe by chance? As the latter are slower.

Roman Ageev · Answer 4 · Thu Sep 15 2022 19:37:48 GMT+0800 (China Standard Time)

Could this be due to slow communication between GPUs?

After profiling, it turns out that communication takes up 66% of the time, and ncclKernel_AllReduce_RING_LL_Sum_half(ncclWorkElem) is used for this. Is it true that in the case of 8 GPUs all-to-all communication is expected?

Could you please tell what environment variables you are using, as the variables from here do not help.

Stas Bekman · Answer 5 · Fri Sep 16 2022 00:49:32 GMT+0800 (China Standard Time)

Unfortunately I no longer have access to JeanZay so I can't retrieve any more data at the moment.

Could this be due to slow communication between GPUs?

That's very possible. If you think of it this could also be an issue of PCIe vs NVLink (or even NVSwitch) and their generations.