arthurdouillard / dytox

Dynamic Token Expansion with Continual Transformers, accepted at CVPR 2022

Home Page:https://arxiv.org/abs/2111.11326

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

accuracy variation depending on the number of GPUs used

zhl98 opened this issue · comments

commented

Hello,Thank you very much for your code!
I used the setting of dytox in the code for 10 steps of training, but I failed to achieve the accuracy in the paper.
bash train.sh 0 --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path MY_PATH_TO_DATASET --output-basedir PATH_TO_SAVE_CHECKPOINTS
Here is the reproduction result:
image
avg acc is 69.54.
Can you give me some advice? thank you very much!

After cleaning the code I've only tested for cifar 50 steps where results where exactly reproduced. I'm re-launching 10 steps to check that.

commented

OK, thank you very much!

Hey, so I haven't time to full reproduce 10 steps with a single GPU but the first 5 steps are indeed like yours.
While when runned with 2 GPUs, I got the exact (even a little better) results from my paper.

I think the error comes from that with two GPUs, I'm actually using a batch size twice larger (PyTorch's DDP will use batch_size on each GPU). So my batch size is bigger than yours which can explain the results.

So what you can do is modifying the cifar_dytox.yaml, and increase the batch size to 256 (128*2).
This option file should work:

#######################
# DyTox, for CIFAR100 #
#######################

# Model definition
model: convit
embed_dim: 384
depth: 6
num_heads: 12
patch_size: 4
input_size: 32
local_up_to_layer: 5
class_attention: true

# Training setting
no_amp: true
eval_every: 50

# Base hyperparameter
weight_decay: 0.000001
batch_size: 128
incremental_lr: 0.0005
incremental_batch_size: 256  # UPDATE VALUE
rehearsal: icarl_all

# Knowledge Distillation
auto_kd: true

# Finetuning
finetuning: balanced
finetuning_epochs: 20

# Dytox model
dytox: true
freeze_task: [old_task_tokens, old_heads]
freeze_ft: [sab]

# Divergence head to get diversity
head_div: 0.1
head_div_mode: tr

# Independent Classifiers
ind_clf: 1-1
bce_loss: true


# Advanced Augmentations, here disabled

## Erasing
reprob: 0.0
remode: pixel
recount: 1
resplit: false

## MixUp & CutMix
mixup: 0.0
cutmix: 0.0

If you have time to tell me if it's working better great, otherwise I'll check it in the coming weeks.

Since I'm 100% sure the results are reproducible with two GPUs, the problem must be that.

commented

Hey, after update the incremental_batch_size to 256 , runned with 1 GPU ,the result is still only 69.50%.
image

But it does seem that the effect of two GPUs is better.
I tested dytox_plus with 2 GPUS getting avg 76.17% (even a little better than your paper).

Hum... I'm launching experiments with batch size of 256 (the yaml that I gave you only did it for step t>1 not t=0 my bad), with a LR of 0.0005 (the default one) and a LR of 0.001 (twice bigger as it would have been if using two GPUs).

I'm also enabling mixed-precision (no_amp: false) to go faster.

I'll keep you updated.

HI,

Posting it here because I'm having the same issue. I ran the Dytox model on Cifar-100 with the same setting as in the first comment here, on a single GPU, and I'm getting the following log

{"task": 0, "epoch": 499, "acc": 92.5, "avg_acc": 92.5, "forgetting": 0.0, "acc_per_task": [92.5], "train_lr": 1.0004539958280581e-05, "bwt": 0.0, "fwt": 0.0, "test_acc1": 92.5, "test_acc5": 99.4, "mean_acc5": 99.4, "train_loss": 0.05053, "test_loss": 0.36721, "token_mean_dist": 0.0, "token_min_dist": 0.0, "token_max_dist": 0.0}
{"task": 1, "epoch": 19, "acc": 85.55, "avg_acc": 89.02, "forgetting": 0.0, "acc_per_task": [87.7, 83.4], "train_lr": 1.2500000000000004e-05, "bwt": 0.0, "fwt": 87.7, "test_acc1": 85.55, "test_acc5": 96.95, "mean_acc5": 98.18, "train_loss": 0.03499, "test_loss": 0.80777, "token_mean_dist": 0.54355, "token_min_dist": 0.54355, "token_max_dist": 0.54355}
{"task": 2, "epoch": 19, "acc": 78.67, "avg_acc": 85.57, "forgetting": 6.25, "acc_per_task": [80.0, 74.0, 82.0], "train_lr": 1.2500000000000004e-05, "bwt": -4.17, "fwt": 80.57, "test_acc1": 78.67, "test_acc5": 94.9, "mean_acc5": 97.08, "train_loss": 0.0259, "test_loss": 1.07032, "token_mean_dist": 0.58243, "token_min_dist": 0.53487, "token_max_dist": 0.61953}
{"task": 3, "epoch": 19, "acc": 73.32, "avg_acc": 82.51, "forgetting": 11.6, "acc_per_task": [71.3, 69.8, 70.6, 81.6], "train_lr": 1.2500000000000004e-05, "bwt": -7.88, "fwt": 75.57, "test_acc1": 73.33, "test_acc5": 93.1, "mean_acc5": 96.09, "train_loss": 0.02083, "test_loss": 1.37981, "token_mean_dist": 0.58081, "token_min_dist": 0.52581, "token_max_dist": 0.61908}
{"task": 4, "epoch": 19, "acc": 69.46, "avg_acc": 79.9, "forgetting": 16.5, "acc_per_task": [65.3, 65.9, 60.7, 71.7, 83.7], "train_lr": 1.2500000000000004e-05, "bwt": -11.33, "fwt": 71.7, "test_acc1": 69.46, "test_acc5": 92.04, "mean_acc5": 95.28, "train_loss": 0.0163, "test_loss": 1.65585, "token_mean_dist": 0.58517, "token_min_dist": 0.51872, "token_max_dist": 0.62832}
{"task": 5, "epoch": 19, "acc": 68.23, "avg_acc": 77.96, "forgetting": 19.32, "acc_per_task": [64.1, 59.3, 54.6, 64.9, 79.3, 87.2], "train_lr": 1.2500000000000004e-05, "bwt": -13.99, "fwt": 69.28, "test_acc1": 68.23, "test_acc5": 91.15, "mean_acc5": 94.59, "train_loss": 0.01265, "test_loss": 1.64966, "token_mean_dist": 0.6064, "token_min_dist": 0.5128, "token_max_dist": 0.70423}
{"task": 6, "epoch": 19, "acc": 64.01, "avg_acc": 75.96, "forgetting": 22.3, "acc_per_task": [60.5, 52.0, 48.8, 56.2, 71.9, 80.3, 78.4], "train_lr": 1.2500000000000004e-05, "bwt": -16.37, "fwt": 67.09, "test_acc1": 64.01, "test_acc5": 89.11, "mean_acc5": 93.81, "train_loss": 0.01232, "test_loss": 1.96759, "token_mean_dist": 0.60002, "token_min_dist": 0.50834, "token_max_dist": 0.7036}
{"task": 7, "epoch": 19, "acc": 60.25, "avg_acc": 74.0, "forgetting": 25.642857, "acc_per_task": [55.3, 46.9, 43.2, 50.9, 60.3, 74.3, 65.3, 85.8], "train_lr": 1.2500000000000004e-05, "bwt": -18.69, "fwt": 64.47, "test_acc1": 60.25, "test_acc5": 87.64, "mean_acc5": 93.04, "train_loss": 0.00952, "test_loss": 2.14214, "token_mean_dist": 0.59949, "token_min_dist": 0.50265, "token_max_dist": 0.70439}
{"task": 8, "epoch": 19, "acc": 58.38, "avg_acc": 72.26, "forgetting": 28.075, "acc_per_task": [53.6, 42.7, 41.5, 48.0, 53.9, 67.2, 57.3, 77.7, 83.5], "train_lr": 1.2500000000000004e-05, "bwt": -20.77, "fwt": 62.42, "test_acc1": 58.38, "test_acc5": 85.98, "mean_acc5": 92.25, "train_loss": 0.00978, "test_loss": 2.24582, "token_mean_dist": 0.59777, "token_min_dist": 0.49842, "token_max_dist": 0.70554}
{"task": 9, "epoch": 19, "acc": 54.61, "avg_acc": 70.5, "forgetting": 31.277778, "acc_per_task": [50.0, 39.4, 32.4, 44.1, 47.7, 63.2, 49.8, 66.5, 74.0, 79.0], "train_lr": 1.2500000000000004e-05, "bwt": -22.87, "fwt": 60.31, "test_acc1": 54.61, "test_acc5": 83.76, "mean_acc5": 91.4, "train_loss": 0.00789, "test_loss": 2.54448, "token_mean_dist": 0.59817, "token_min_dist": 0.49496, "token_max_dist": 0.70778}
{"avg": 70.49870843967983}

Is this accuracy expected? The final accuracy (54.61) is lower than the number I see on the paper for cifar-100, 10 steps. I'm trying to understand how multi-gpu training alone can bring in such a big improvement. Any help would be much appreciated.

Hello, I'm still trying to improve perfs on a single GPU. I'll keep this issue updated if I find ways to do it.

In the mean time, try running on two GPUs, as the results have been reproduced by multiple people (including @zhl98 for openned this issue).

Hi,

Just a short update. I thought repeated augmentation could be the reason behind improved results in multi-GPU, so I ran it without RA, but I was still getting around 59% accuracy, which means that cannot be the reason. Please let us know if you were able to figure out how to make it work in single-GPU setting.

Yeah, I chatted with Hugo Touvron (the DeiT main author) and he also suggested RA. I've tried multi-gpu without RA and single-gpu with RA, and nothing significantly changed.

I'll keep you updated.

Accuracy variation is in major part explained in the following erratum.
We are trying to see how we could emulate our distributed memory (see erratum) in the single GPU setting.