SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NAT Tiny performance on ImageNet 1k

jamesben6688 opened this issue · comments

commented

Hi Ali,

Thank you for sharing your impressive work.

I ran the NAT Tiny model on ImageNet 1K for 310 epochs, but I could only achieve a Top-1 accuracy of 82.45%. I used 4 A100 GPUs and changed the batch size to 832.

May I know the specific configurations you used for training the model (hardware, other hyper-parameters), and how long it took to finish the training?

Hello and thank you for your interest,

I'd refer you to our classification training configs here. We trained on 8xA100s with a total batch size of 1024 (128 per GPU).

When we were working on the original NAT paper and trained NAT Tiny, we only had very early stages of our naive kernels at hand, so training took a few days. But if you install NATTEN right now on an A100 machine and train it with our config, it should take less than 24 hours with mixed precision.
If you build NATTEN from source on your machine, you'll be running our GEMM kernels, which should be even faster than the public version.

commented

I installed NATTEN with

pip3 install natten -f https://shi-labs.com/natten/wheels/{cu_version}/torch{torch_version}/index.html.

I tested the functions:

import natten

# Check if NATTEN was built with CUDA
print(natten.has_cuda())

# Check if NATTEN with CUDA was built with support for float16
print(natten.has_half())

# Check if NATTEN with CUDA was built with support for bfloat16
print(natten.has_bfloat())

# Check if NATTEN with CUDA was built with the new GEMM kernels
print(natten.has_gemm())

However, it shows Module natten' has no attribute 'has_cuda'/has_half/has_bfloat/has_gemm.

By the way, I found that a batch size of 128 cannot fully utilize the 80G memory of the A100, so I increased the batch size to 832. This took 8 minutes to train one epoch. Will this affect the performance? May I know why you set the batch size to 128 for each GPU?

This is because you installed a NATTEN release; to get the GEMM kernels, you need to build from source.

And re: batch size, there is absolutely nothing wrong with leaving memory free in your GPU. Just because the GPU has free memory doesn't mean it has enough compute to handle that. Without any specific reasons, increasing your batch size to fill up your GPU memory is generally not a good idea.

And given that batch statistics heavily impact training, if you're looking to reproduce a number, you should follow the exact settings.
Batch size 1024 is very common for similar models (architecture and size) trained on ImageNet-1K from scratch.

commented

Thank you so much for your response.

Does this mean I need to change the batch size from 128 to 256 for 4 GPUs?

Will install NATTEN from source and try it.

Yes that's correct, 256 on 4 GPUs is still 1024 total. Although we don't use batch norm anywhere, it still might converge to a slightly different accuracy with that change, but it should be minimal.

commented

Ok. Thank you so much.

Closing this due to inactivity.