Is the model training using Tensor Core or CUDA Core when running on GPU?

Question

Is the model training using Tensor Core or CUDA Core when running on GPU?

onnkeat opened this issue a year ago · comments

Modern Nvidia GPU provides a few types of processing cores such as CUDA Core, Tensor Core, RT Core, and others. For example, Tensor Core is optimised for neural network/AI-related workload that requires huge matrix computation.

How can we check which processing core is used when training the model?

Can we select which processing core to use?

David Beauchemin · Answer 1 · Sun Jul 09 2023 08:11:07 GMT+0800 (China Standard Time)

The actual training loop does not leverage torch.compile and such features. There is a stale PR to integrate such feature, I am waiting for a more stable version of Torch to validate it works properly and my preliminary test was not interesting on speed improvement.

You can test the branch if torch.compile suits your need. But it is stale and not up-to-date with the dev branch.

Also, training is typically not that long, so to me is not an urgent matter. We are currently working on an out-of-the-box API/Docker feature to use Deepparse as an API service.

Chong Onn Keat · Answer 2 · Sun Jul 09 2023 13:20:07 GMT+0800 (China Standard Time)

No worries I am not asking for any enhancement or new features.

Just trying to understand which processing core of GPU is used for model training.

It seems that Nvidia GPU will use Tensor core for certain data types while the rest would be using CUDA core. link

It would depend on the GPU, operations and data types being used. For Volta: fp16 should use tensor cores by default for common ops like matmul and conv. For Ampere and newer, fp16, bf16 should use tensor cores for common ops and fp32 for convs (via TF32).

There is no urgency to implement an enhancement if the model is not using Tensor core.

David Beauchemin · Answer 3 · Tue Jul 11 2023 03:10:36 GMT+0800 (China Standard Time)

I never dig into such specific details about using TPU/Cuda Core. From my understanding, Torch handle this implementation distinction by using simply Torch or torch.compile. So, it should not be of any concern.

However, we could definitely look into using a floating point in 16 bits instead of 32 (fp16) with the new Torch Apex integration.

However, as I said, training is not much of a concern other than inference. I will add this to or backlog.

David Beauchemin · Answer 4 · Thu Aug 03 2023 21:22:00 GMT+0800 (China Standard Time)

We have added the feature "support fp16" to our backlog.

Is there anything else you'd like to talk about about your issue?

Chong Onn Keat · Answer 5 · Sat Aug 05 2023 20:01:16 GMT+0800 (China Standard Time)

Nope for me. Thanks for your efforts and explainations.

Anyway, I don't really need this feature at the moment. I was just curious which processing core is used for model training after I read an article about using Tensor Core in Nvidia GPU to speed up deep learning model training.

I am not sure how it can relate to or apply here. You may consider adding "support fp16" if you found that it will be useful.

Thanks