mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Home Page:https://torchsparse.mit.edu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Any plan to support bfloat16?

pycoco opened this issue · comments

@ys-2020, could you please take a look at this issue when you have time? Thanks!

Hi @pycoco , thanks for your interest. bfloat16 is typically used for training jobs. However, we have launched many training jobs and found that float16 will not affect the accuracy. That's why we do not support bfloat16 now.

If you find there is any job that bfloat16 can have better training results, please let us know. And we will plan to implement it.

@ys-2020 Thanks for your quick reply and great work,i found that model with float16 will encounter loss nan problem in certain scenarios. Maybe it is caused by underflow/overflow. So this a good choice to support bfloat16 in training.

@pycoco . Hi! Thank you for the feedback. Can you provide more details about the 'certain scenario'? Actually we have launched a lot of training jobs on segmentation/detection tasks and many different datasets, and we didn't meet the nan loss. (Also, you can change to fp32 as a backup plan for now.)

@ys-2020 In my scenario, i use voxelnext with voxel size [0.05, 0.05, 0.15], range [-100.0, -100.0, -1.5, 100.0, 100.0, 4.5] and own dataset. FP32 is normal but training time is too long. Actually i use 'spconv' now, maybe i should adapt to our library and have a try(but i think the library is not the reason of the problem).