FP8 not working

Question

FP8 not working

prigoyal opened this issue 5 months ago · comments

Hello, this is follow-up of earlier issue we reported. We are unable to run the simple mpt-1b fp8 baseline.

We have done due diligence in identifying various compatible dependencies versions etc and are sharing below what we have tried. We also use the llm-foundry dockers and share our docker files , error logs, config files etc every detail. We would greatly appreciate any insight and what we are missing.

cc @growlix

Please scroll the columns to see the Build and Runtime status

llm-foundry composer pytorch cuda TransformerEngine Flash-attn required Flash-attn version used Build Runtime

0.3.0 >=0.16.3,  < 0.17 2.0.1 11.8 v0.10 >=1.0.6, <=1.0.7 1.0.7

v0.12 >=1.0.6, <=2.0.4 1.0.7 ✅ ❌ (same error as job935 log)

stable >= 1.0.6,<= 2.3.3,!= 2.0.9!= 2.1.0 1.0.7 ✅ ❌Different API error

0.4.0 >=0.17,    < 0.18 2.0.1 11.8 main >= 2.0.6,<= 2.4.2,!= 2.0.9!= 2.1.0 2.4.2 ✅docker ❌Different API error

0.4.0 >=0.17,    < 0.18 2.0.1 11.8 v0.10 >=1.0.6, <=1.0.7 1.0.7

v0.12 >=1.0.6, <=2.0.4 1.0.7 ✅ docker, yaml config ❌ job935 log (same issue we reported in #885)

0.4.0 >=0.17,    < 0.18 2.1.0 12.1 v0.10 >=1.0.6, <=1.0.7

v0.12 >=1.0.6, <=2.0.4

main >= 2.0.6,<= 2.4.2,!= 2.0.9!= 2.1.0 2.4.2 ✅ docker ❌ initial error log but resolved with init_device: cpu but hit new same error is job935 log.

Priya Goyal · Answer 1 · Wed Jan 24 2024 04:05:05 GMT+0800 (China Standard Time)

Hi @j316chuck , just flagging this follow-up issue if you can help!

Priya Goyal · Answer 2 · Fri Jan 26 2024 05:42:48 GMT+0800 (China Standard Time)

Updating that we removed

model:
  fc_type: te
  ffn_config_defaults:
    ffn_type: te_ln_mlp

which solved it for us.

llm-foundry	composer	pytorch	cuda	TransformerEngine	Flash-attn required	Flash-attn version used	Build	Runtime
0.3.0	>=0.16.3, < 0.17	2.0.1	11.8	v0.10	>=1.0.6, <=1.0.7	1.0.7
				v0.12	>=1.0.6, <=2.0.4	1.0.7	✅	❌ (same error as job935 log)
				stable	>= 1.0.6,<= 2.3.3,!= 2.0.9!= 2.1.0	1.0.7	✅	❌Different API error
0.4.0	>=0.17, < 0.18	2.0.1	11.8	main	>= 2.0.6,<= 2.4.2,!= 2.0.9!= 2.1.0	2.4.2	✅docker	❌Different API error
0.4.0	>=0.17, < 0.18	2.0.1	11.8	v0.10	>=1.0.6, <=1.0.7	1.0.7
				v0.12	>=1.0.6, <=2.0.4	1.0.7	✅ docker, yaml config	❌ job935 log (same issue we reported in #885)
0.4.0	>=0.17, < 0.18	2.1.0	12.1	v0.10	>=1.0.6, <=1.0.7
				v0.12	>=1.0.6, <=2.0.4
				main	>= 2.0.6,<= 2.4.2,!= 2.0.9!= 2.1.0	2.4.2	✅ docker	❌ initial error log but resolved with init_device: cpu but hit new same error is job935 log.