bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slower inference results for BLOOM fp16 on identical hardware

sarthaklangde opened this issue · comments

Hey,

Thank you for the scripts for loading checkpoints and running benchmarks. I have a strange issue that ds_inference fp16 throughput is quite slower than the results mentioned. But, the int8 benchmark results are almost identical.

Environment:
GCP a2-ultragpu-8g with A100 8x80GB, 1.3 TB Memory, 96 vCPUs
Debian 11

For fp16 & batch size 1, the throughput I receive is 67 msecs/token while it should be possible to get 44 msecs/token. This trend is repeated for higher batch sizes too.

But for int8, the results are exactly the same as the one mentioned in benchmarks (both for 8x80GB and 4x80GB).

What have I tried until now?

  1. Different CUDA versions (11.0, 11.4, 11.6, 11.7), PyTorch versions, DeepSpeed versions (0.7.0, 0.7.2, 0.7.3)
  2. Reinstalling environment from scratch on a new server

Any idea on what I might be doing wrong? Or is everybody else experiencing similar throughput?

@sarthaklangde i have the same issue.
I believe this might be due to the internal pcie tree implementation being different.
@stas00 fyi

I dont believe it has anything to do with your environment

My tests were run on JeanZay HPC so it's possible their servers are somehow beefier hardware-wise?

It is interesting that you both report the same speed with int8.

@RezaYazdaniAminabadi, do you by chance have any insights at why this might be the case? What throughput do you get on your Azure nodes for bs=1 so that we have another point of comparison.

There are 2 hardware versions of A100. Do we know if A100s are all SXM and not PCIe by chance? As the latter are slower.

Could this be due to slow communication between GPUs?

After profiling, it turns out that communication takes up 66% of the time, and ncclKernel_AllReduce_RING_LL_Sum_half(ncclWorkElem) is used for this. Is it true that in the case of 8 GPUs all-to-all communication is expected?

Could you please tell what environment variables you are using, as the variables from here do not help.

Unfortunately I no longer have access to JeanZay so I can't retrieve any more data at the moment.

Could this be due to slow communication between GPUs?

That's very possible. If you think of it this could also be an issue of PCIe vs NVLink (or even NVSwitch) and their generations.