Any plan to support bfloat16?

Question

Any plan to support bfloat16?

pycoco opened this issue 7 months ago · comments

Zhijian Liu · Answer 1 · Mon Dec 11 2023 12:00:22 GMT+0800 (China Standard Time)

@ys-2020, could you please take a look at this issue when you have time? Thanks!

Shang Yang · Answer 2 · Mon Dec 11 2023 23:41:29 GMT+0800 (China Standard Time)

Hi @pycoco , thanks for your interest. bfloat16 is typically used for training jobs. However, we have launched many training jobs and found that float16 will not affect the accuracy. That's why we do not support bfloat16 now.

If you find there is any job that bfloat16 can have better training results, please let us know. And we will plan to implement it.

pycoco · Answer 3 · Tue Dec 12 2023 12:01:26 GMT+0800 (China Standard Time)

@ys-2020 Thanks for your quick reply and great work，i found that model with float16 will encounter loss nan problem in certain scenarios. Maybe it is caused by underflow/overflow. So this a good choice to support bfloat16 in training.

Shang Yang · Answer 4 · Tue Dec 12 2023 12:48:02 GMT+0800 (China Standard Time)

@pycoco . Hi! Thank you for the feedback. Can you provide more details about the 'certain scenario'? Actually we have launched a lot of training jobs on segmentation/detection tasks and many different datasets, and we didn't meet the nan loss. (Also, you can change to fp32 as a backup plan for now.)

pycoco · Answer 5 · Tue Dec 12 2023 13:14:49 GMT+0800 (China Standard Time)

@ys-2020 In my scenario, i use voxelnext with voxel size [0.05, 0.05, 0.15], range [-100.0, -100.0, -1.5, 100.0, 100.0, 4.5] and own dataset. FP32 is normal but training time is too long. Actually i use 'spconv' now, maybe i should adapt to our library and have a try(but i think the library is not the reason of the problem).