accuracy variation depending on the number of GPUs used

Question

accuracy variation depending on the number of GPUs used

zhl98 opened this issue 2 years ago · comments

Hello,Thank you very much for your code！
I used the setting of dytox in the code for 10 steps of training, but I failed to achieve the accuracy in the paper.
bash train.sh 0 --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path MY_PATH_TO_DATASET --output-basedir PATH_TO_SAVE_CHECKPOINTS
Here is the reproduction result：

avg acc is 69.54.
Can you give me some advice? thank you very much!

Arthur Douillard · Answer 1 · Thu Mar 31 2022 16:54:19 GMT+0800 (China Standard Time)

After cleaning the code I've only tested for cifar 50 steps where results where exactly reproduced. I'm re-launching 10 steps to check that.

zhl98 · Answer 2 · Thu Mar 31 2022 16:58:34 GMT+0800 (China Standard Time)

OK, thank you very much!

Arthur Douillard · Answer 3 · Fri Apr 01 2022 17:10:04 GMT+0800 (China Standard Time)

Hey, so I haven't time to full reproduce 10 steps with a single GPU but the first 5 steps are indeed like yours.
While when runned with 2 GPUs, I got the exact (even a little better) results from my paper.

I think the error comes from that with two GPUs, I'm actually using a batch size twice larger (PyTorch's DDP will use batch_size on each GPU). So my batch size is bigger than yours which can explain the results.

So what you can do is modifying the cifar_dytox.yaml, and increase the batch size to 256 (128*2).
This option file should work:

#######################
# DyTox, for CIFAR100 #
#######################

# Model definition
model: convit
embed_dim: 384
depth: 6
num_heads: 12
patch_size: 4
input_size: 32
local_up_to_layer: 5
class_attention: true

# Training setting
no_amp: true
eval_every: 50

# Base hyperparameter
weight_decay: 0.000001
batch_size: 128
incremental_lr: 0.0005
incremental_batch_size: 256  # UPDATE VALUE
rehearsal: icarl_all

# Knowledge Distillation
auto_kd: true

# Finetuning
finetuning: balanced
finetuning_epochs: 20

# Dytox model
dytox: true
freeze_task: [old_task_tokens, old_heads]
freeze_ft: [sab]

# Divergence head to get diversity
head_div: 0.1
head_div_mode: tr

# Independent Classifiers
ind_clf: 1-1
bce_loss: true


# Advanced Augmentations, here disabled

## Erasing
reprob: 0.0
remode: pixel
recount: 1
resplit: false

## MixUp & CutMix
mixup: 0.0
cutmix: 0.0

If you have time to tell me if it's working better great, otherwise I'll check it in the coming weeks.

Since I'm 100% sure the results are reproducible with two GPUs, the problem must be that.

zhl98 · Answer 4 · Sun Apr 03 2022 15:29:08 GMT+0800 (China Standard Time)

Hey, after update the incremental_batch_size to 256 , runned with 1 GPU ,the result is still only 69.50%.

But it does seem that the effect of two GPUs is better.
I tested dytox_plus with 2 GPUS getting avg 76.17% (even a little better than your paper).

Arthur Douillard · Answer 5 · Sun Apr 03 2022 19:53:00 GMT+0800 (China Standard Time)

Hum... I'm launching experiments with batch size of 256 (the yaml that I gave you only did it for step t>1 not t=0 my bad), with a LR of 0.0005 (the default one) and a LR of 0.001 (twice bigger as it would have been if using two GPUs).

I'm also enabling mixed-precision (no_amp: false) to go faster.

I'll keep you updated.

Kishaan Jeeveswaran · Answer 6 · Tue May 10 2022 16:39:28 GMT+0800 (China Standard Time)

HI,

Posting it here because I'm having the same issue. I ran the Dytox model on Cifar-100 with the same setting as in the first comment here, on a single GPU, and I'm getting the following log

{"task": 0, "epoch": 499, "acc": 92.5, "avg_acc": 92.5, "forgetting": 0.0, "acc_per_task": [92.5], "train_lr": 1.0004539958280581e-05, "bwt": 0.0, "fwt": 0.0, "test_acc1": 92.5, "test_acc5": 99.4, "mean_acc5": 99.4, "train_loss": 0.05053, "test_loss": 0.36721, "token_mean_dist": 0.0, "token_min_dist": 0.0, "token_max_dist": 0.0}
{"task": 1, "epoch": 19, "acc": 85.55, "avg_acc": 89.02, "forgetting": 0.0, "acc_per_task": [87.7, 83.4], "train_lr": 1.2500000000000004e-05, "bwt": 0.0, "fwt": 87.7, "test_acc1": 85.55, "test_acc5": 96.95, "mean_acc5": 98.18, "train_loss": 0.03499, "test_loss": 0.80777, "token_mean_dist": 0.54355, "token_min_dist": 0.54355, "token_max_dist": 0.54355}
{"task": 2, "epoch": 19, "acc": 78.67, "avg_acc": 85.57, "forgetting": 6.25, "acc_per_task": [80.0, 74.0, 82.0], "train_lr": 1.2500000000000004e-05, "bwt": -4.17, "fwt": 80.57, "test_acc1": 78.67, "test_acc5": 94.9, "mean_acc5": 97.08, "train_loss": 0.0259, "test_loss": 1.07032, "token_mean_dist": 0.58243, "token_min_dist": 0.53487, "token_max_dist": 0.61953}
{"task": 3, "epoch": 19, "acc": 73.32, "avg_acc": 82.51, "forgetting": 11.6, "acc_per_task": [71.3, 69.8, 70.6, 81.6], "train_lr": 1.2500000000000004e-05, "bwt": -7.88, "fwt": 75.57, "test_acc1": 73.33, "test_acc5": 93.1, "mean_acc5": 96.09, "train_loss": 0.02083, "test_loss": 1.37981, "token_mean_dist": 0.58081, "token_min_dist": 0.52581, "token_max_dist": 0.61908}
{"task": 4, "epoch": 19, "acc": 69.46, "avg_acc": 79.9, "forgetting": 16.5, "acc_per_task": [65.3, 65.9, 60.7, 71.7, 83.7], "train_lr": 1.2500000000000004e-05, "bwt": -11.33, "fwt": 71.7, "test_acc1": 69.46, "test_acc5": 92.04, "mean_acc5": 95.28, "train_loss": 0.0163, "test_loss": 1.65585, "token_mean_dist": 0.58517, "token_min_dist": 0.51872, "token_max_dist": 0.62832}
{"task": 5, "epoch": 19, "acc": 68.23, "avg_acc": 77.96, "forgetting": 19.32, "acc_per_task": [64.1, 59.3, 54.6, 64.9, 79.3, 87.2], "train_lr": 1.2500000000000004e-05, "bwt": -13.99, "fwt": 69.28, "test_acc1": 68.23, "test_acc5": 91.15, "mean_acc5": 94.59, "train_loss": 0.01265, "test_loss": 1.64966, "token_mean_dist": 0.6064, "token_min_dist": 0.5128, "token_max_dist": 0.70423}
{"task": 6, "epoch": 19, "acc": 64.01, "avg_acc": 75.96, "forgetting": 22.3, "acc_per_task": [60.5, 52.0, 48.8, 56.2, 71.9, 80.3, 78.4], "train_lr": 1.2500000000000004e-05, "bwt": -16.37, "fwt": 67.09, "test_acc1": 64.01, "test_acc5": 89.11, "mean_acc5": 93.81, "train_loss": 0.01232, "test_loss": 1.96759, "token_mean_dist": 0.60002, "token_min_dist": 0.50834, "token_max_dist": 0.7036}
{"task": 7, "epoch": 19, "acc": 60.25, "avg_acc": 74.0, "forgetting": 25.642857, "acc_per_task": [55.3, 46.9, 43.2, 50.9, 60.3, 74.3, 65.3, 85.8], "train_lr": 1.2500000000000004e-05, "bwt": -18.69, "fwt": 64.47, "test_acc1": 60.25, "test_acc5": 87.64, "mean_acc5": 93.04, "train_loss": 0.00952, "test_loss": 2.14214, "token_mean_dist": 0.59949, "token_min_dist": 0.50265, "token_max_dist": 0.70439}
{"task": 8, "epoch": 19, "acc": 58.38, "avg_acc": 72.26, "forgetting": 28.075, "acc_per_task": [53.6, 42.7, 41.5, 48.0, 53.9, 67.2, 57.3, 77.7, 83.5], "train_lr": 1.2500000000000004e-05, "bwt": -20.77, "fwt": 62.42, "test_acc1": 58.38, "test_acc5": 85.98, "mean_acc5": 92.25, "train_loss": 0.00978, "test_loss": 2.24582, "token_mean_dist": 0.59777, "token_min_dist": 0.49842, "token_max_dist": 0.70554}
{"task": 9, "epoch": 19, "acc": 54.61, "avg_acc": 70.5, "forgetting": 31.277778, "acc_per_task": [50.0, 39.4, 32.4, 44.1, 47.7, 63.2, 49.8, 66.5, 74.0, 79.0], "train_lr": 1.2500000000000004e-05, "bwt": -22.87, "fwt": 60.31, "test_acc1": 54.61, "test_acc5": 83.76, "mean_acc5": 91.4, "train_loss": 0.00789, "test_loss": 2.54448, "token_mean_dist": 0.59817, "token_min_dist": 0.49496, "token_max_dist": 0.70778}
{"avg": 70.49870843967983}

Is this accuracy expected? The final accuracy (54.61) is lower than the number I see on the paper for cifar-100, 10 steps. I'm trying to understand how multi-gpu training alone can bring in such a big improvement. Any help would be much appreciated.

Arthur Douillard · Answer 7 · Tue May 10 2022 17:32:22 GMT+0800 (China Standard Time)

Hello, I'm still trying to improve perfs on a single GPU. I'll keep this issue updated if I find ways to do it.

In the mean time, try running on two GPUs, as the results have been reproduced by multiple people (including @zhl98 for openned this issue).

Kishaan Jeeveswaran · Answer 8 · Mon May 23 2022 15:40:32 GMT+0800 (China Standard Time)

Hi,

Just a short update. I thought repeated augmentation could be the reason behind improved results in multi-GPU, so I ran it without RA, but I was still getting around 59% accuracy, which means that cannot be the reason. Please let us know if you were able to figure out how to make it work in single-GPU setting.

Arthur Douillard · Answer 9 · Mon May 23 2022 17:44:09 GMT+0800 (China Standard Time)

Yeah, I chatted with Hugo Touvron (the DeiT main author) and he also suggested RA. I've tried multi-gpu without RA and single-gpu with RA, and nothing significantly changed.

I'll keep you updated.

Arthur Douillard · Answer 10 · Fri Jun 17 2022 04:10:10 GMT+0800 (China Standard Time)

Accuracy variation is in major part explained in the following erratum.
We are trying to see how we could emulate our distributed memory (see erratum) in the single GPU setting.