ZeRO 3 NVMe Offload?

Question

ZeRO 3 NVMe Offload?

Vbansal21 opened this issue 2 years ago · comments

Firstly, thank you for making the weights open to all!

Now, how to do inference/fine-tuning with ZeRO 3 NVMe/CPU Offload, since this project is based on Megatron project, that shouldn't be much difficult to implement?

Mikhail Khrushchev · Answer 1 · Tue Jun 28 2022 21:07:50 GMT+0800 (China Standard Time)

You can use ZeRO Offload while finetuning your model.
You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard. However, GPU wil have to download entire model (200 GB) for each inferred token. It takes additional 8 sec for each inferred token in best configuration =(

Victor · Answer 2 · Tue Jun 28 2022 23:55:47 GMT+0800 (China Standard Time)

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

Vaibhav Bansal · Answer 3 · Wed Jun 29 2022 02:02:24 GMT+0800 (China Standard Time)

@MichaelEk Thanks for replying.

Actually I wasn't able to implement the offload while inference, that's why I created an issue here. It would be nice if an updated script could be provided.😅

Mikhail Khrushchev · Answer 4 · Wed Jun 29 2022 04:48:58 GMT+0800 (China Standard Time)

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights. However, the total time of one generation will be too long to use it, for example, in chatbot.

Pavel · Answer 5 · Sat Aug 06 2022 01:48:06 GMT+0800 (China Standard Time)

You can use ZeRO Offload while finetuning your model. You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard.

Thanks for the hint. Was able to run on RTX 3070 TI 😃

can't you generate multiple tokens in parallel for different string prefixes?
Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights.

Can you tell a little more about this? I feel that now I need to do this, but I do not know where to start.

Victor · Answer 6 · Sat Aug 06 2022 04:31:13 GMT+0800 (China Standard Time)

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

Pavel · Answer 7 · Sat Aug 06 2022 17:20:59 GMT+0800 (China Standard Time)

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

No, I don't have a working code yet. But it will work soon.
Running on nvme in this config is useless. On CPU RAM it is more realistic, although it is still very long.

In any case, it's all in a private repository.

Igal Alkon · Answer 8 · Sat Aug 13 2022 22:20:58 GMT+0800 (China Standard Time)

@pskucherov Being able to run on RTX 3070 Ti is really great.
If you could share some code how to "ZeRO Offload" that would be nice!

Vaibhav Bansal · Answer 9 · Sun Sep 04 2022 20:47:22 GMT+0800 (China Standard Time)

Huggingface accelerate seems to be the solution. Closing issue.