mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FP8 not working

prigoyal opened this issue · comments

Hello, this is follow-up of earlier issue we reported. We are unable to run the simple mpt-1b fp8 baseline.

We have done due diligence in identifying various compatible dependencies versions etc and are sharing below what we have tried. We also use the llm-foundry dockers and share our docker files , error logs, config files etc every detail. We would greatly appreciate any insight and what we are missing.

cc @growlix

Please scroll the columns to see the Build and Runtime status

llm-foundry composer pytorch cuda TransformerEngine Flash-attn required Flash-attn version used Build Runtime
0.3.0 >=0.16.3,  < 0.17 2.0.1 11.8 v0.10 >=1.0.6, <=1.0.7 1.0.7    
        v0.12 >=1.0.6, <=2.0.4 1.0.7 ❌ (same error as job935 log)
        stable >= 1.0.6,<= 2.3.3,!= 2.0.9!= 2.1.0 1.0.7 Different API error
0.4.0 >=0.17,    < 0.18 2.0.1 11.8 main >= 2.0.6,<= 2.4.2,!= 2.0.9!= 2.1.0 2.4.2 docker Different API error
0.4.0 >=0.17,    < 0.18 2.0.1 11.8 v0.10 >=1.0.6, <=1.0.7 1.0.7    
        v0.12 >=1.0.6, <=2.0.4 1.0.7 docker, yaml config ❌ job935 log (same issue we reported in #885)
0.4.0 >=0.17,    < 0.18 2.1.0 12.1 v0.10 >=1.0.6, <=1.0.7      
        v0.12 >=1.0.6, <=2.0.4      
        main >= 2.0.6,<= 2.4.2,!= 2.0.9!= 2.1.0 2.4.2 docker  ❌ initial error log but resolved with init_device: cpu but hit new same error is job935 log.

Hi @j316chuck , just flagging this follow-up issue if you can help!

Updating that we removed

model:
  fc_type: te
  ffn_config_defaults:
    ffn_type: te_ln_mlp

which solved it for us.