Issues in using FP8 for MPT baselines on H100

Question

prigoyal opened this issue 8 months ago · comments

Hello,

I am trying to train MPT models using fp8 and currently hitting issues similar to what has been reported in #271 .

The changes I made are: Installing the flash-attn and TransformerEngine:

pip install flash-attn==1.0.7 --no-build-isolation
pip install git+https://github.com/NVIDIA/TransformerEngine.git@v0.10

and then making following changes to our config files (following the tutorials):

precision: amp_fp8
model:
  fc_type: te
  ffn_config_defaults:
    ffn_type: te_ln_mlp

The llm-foundry version I am using is 0.3.0

I would appreciate if you can share any insights into what could be missing from our setup to successfully use fp8. cc @growlix

Charles Tang · Answer 1 · Mon Jan 22 2024 11:48:07 GMT+0800 (China Standard Time)

Seems that your TransformerEngine version is outdated:

Try:

pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

If that doesn't work try:

git clone https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine
pip install -e .
cd ..

Priya Goyal · Answer 2 · Tue Jan 23 2024 04:08:40 GMT+0800 (China Standard Time)

thanks @j316chuck , will give that a shot. It might be very helpful to update the README.md as well in case others run into the same issue.