yandex / YaLM-100B

Pretrained language model with 100B parameters

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ZeRO 3 NVMe Offload?

Vbansal21 opened this issue · comments

Firstly, thank you for making the weights open to all!

Now, how to do inference/fine-tuning with ZeRO 3 NVMe/CPU Offload, since this project is based on Megatron project, that shouldn't be much difficult to implement?

You can use ZeRO Offload while finetuning your model.
You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard. However, GPU wil have to download entire model (200 GB) for each inferred token. It takes additional 8 sec for each inferred token in best configuration =(

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

@MichaelEk Thanks for replying.

Actually I wasn't able to implement the offload while inference, that's why I created an issue here. It would be nice if an updated script could be provided.😅

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights. However, the total time of one generation will be too long to use it, for example, in chatbot.

commented

You can use ZeRO Offload while finetuning your model. You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard.

Thanks for the hint. Was able to run on RTX 3070 TI 😃
image

can't you generate multiple tokens in parallel for different string prefixes?
Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights.

Can you tell a little more about this? I feel that now I need to do this, but I do not know where to start.

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

commented

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

No, I don't have a working code yet. But it will work soon.
Running on nvme in this config is useless. On CPU RAM it is more realistic, although it is still very long.

In any case, it's all in a private repository.

@pskucherov Being able to run on RTX 3070 Ti is really great.
If you could share some code how to "ZeRO Offload" that would be nice!

Huggingface accelerate seems to be the solution. Closing issue.