andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page:http://andrewowens.com/multisensory/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why acc doesn't change when shift_model training?

ruizewang opened this issue · comments

commented

Hello,
When I train shift_lowfps model, the loss decreases slowly, but acc doesn't change (0.500).
Could you give me some advice?

  • Is it because the training time is too short?
  • Should acc increase when loss decrease?
    By the way, what does the total loss mean?

[grad norm:][0.0125109516]
Iteration 5500, lr = 1e-03, total:loss: 1.246 reg: 0.041 loss:label: 0.705 acc:label: 0.500, time: 2.978
Iteration 5510, lr = 1e-03, total:loss: 1.244 reg: 0.040 loss:label: 0.704 acc:label: 0.500, time: 2.974
Iteration 5520, lr = 1e-03, total:loss: 1.241 reg: 0.039 loss:label: 0.703 acc:label: 0.500, time: 2.953
Iteration 5530, lr = 1e-03, total:loss: 1.239 reg: 0.037 loss:label: 0.702 acc:label: 0.500, time: 2.960
Iteration 5540, lr = 1e-03, total:loss: 1.238 reg: 0.036 loss:label: 0.701 acc:label: 0.500, time: 2.971
Iteration 5550, lr = 1e-03, total:loss: 1.236 reg: 0.035 loss:label: 0.700 acc:label: 0.500, time: 2.965
Iteration 5560, lr = 1e-03, total:loss: 1.234 reg: 0.034 loss:label: 0.700 acc:label: 0.500, time: 2.961
Iteration 5570, lr = 1e-03, total:loss: 1.232 reg: 0.033 loss:label: 0.699 acc:label: 0.500, time: 2.957
Iteration 5580, lr = 1e-03, total:loss: 1.231 reg: 0.032 loss:label: 0.699 acc:label: 0.500, time: 2.952
Iteration 5590, lr = 1e-03, total:loss: 1.229 reg: 0.031 loss:label: 0.698 acc:label: 0.500, time: 2.967
[grad norm:][0.00501754601]
Iteration 5600, lr = 1e-03, total:loss: 1.228 reg: 0.030 loss:label: 0.698 acc:label: 0.500, time: 2.968
Iteration 5610, lr = 1e-03, total:loss: 1.227 reg: 0.030 loss:label: 0.697 acc:label: 0.500, time: 2.960
Iteration 5620, lr = 1e-03, total:loss: 1.225 reg: 0.029 loss:label: 0.697 acc:label: 0.500, time: 2.951
Iteration 5630, lr = 1e-03, total:loss: 1.224 reg: 0.028 loss:label: 0.696 acc:label: 0.500, time: 2.977
Iteration 5640, lr = 1e-03, total:loss: 1.223 reg: 0.027 loss:label: 0.696 acc:label: 0.500, time: 2.973
Iteration 5650, lr = 1e-03, total:loss: 1.222 reg: 0.026 loss:label: 0.696 acc:label: 0.500, time: 2.981

Yes, this is a common failure mode! The model also takes a long time to get better-than-chance performance, which can make it look like it's stuck.

  • What batch size are you using? Are you training on AudioSet? Note that I trained that model with 3 GPUs, so the effective batch size was 45.
  • The loss values that you should probably be looking at are "loss:label", which is the cross-entropy loss, and "acc" which is the overall accuracy. Here, chance performance would be acc = 0.5 and loss:label = ln(0.5) = 0.693. So, it looks like the model has not yet reached chance performance.
  • In my experiments, the model took something like 2K iterations to reach chance performance (loss:label = 0.693), and 11K iterations to do better than chance (loss:label = 0.692). So, for a long time it looked like the model was stuck at chance.
  • Did you decrease the learning rate? I trained with lr = 1e-2 at the beginning. This might explain why your model is still doing worse than chance at 5K iterations.
commented

Yes, this is a common failure mode! The model also takes a long time to get better-than-chance performance, which can make it look like it's stuck.

  • What batch size are you using? Are you training on AudioSet? Note that I trained that model with 3 GPUs, so the effective batch size was 45.
  • The loss values that you should probably be looking at are "loss:label", which is the cross-entropy loss, and "acc" which is the overall accuracy. Here, chance performance would be acc = 0.5 and loss:label = ln(0.5) = 0.693. So, it looks like the model has not yet reached chance performance.
  • In my experiments, the model took something like 2K iterations to reach chance performance (loss:label = 0.693), and 11K iterations to do better than chance (loss:label = 0.692). So, for a long time it looked like the model was stuck at chance.
  • Did you decrease the learning rate? I trained with lr = 1e-2 at the beginning. This might explain why your model is still doing worse than chance at 5K iterations.

Thank you very much for your explanation. This makes me suddenly understand.

  • I use one GPU, 1080Ti, so the batch size is 15. Yes, I am training on Audioset, but the dataset is not as big as in yours which contains 750,000 videos, I use 600,000 videos.
  • Haha, I see my training process was not going well, so I try Adam with lr=1e-3 (results as mentioned above) . Actually, I tried your default setting, 'momentum' optimizer with lr=1e-2. The result as follows, but it seems worse than chance performance. Maybe, I should train for a longer time.
Iteration 15080, lr = 1e-02, total:loss: 1.257 reg: 0.068 loss:label: 0.693 acc:label: 0.496, time: 7.318
Iteration 15090, lr = 1e-02, total:loss: 1.257 reg: 0.068 loss:label: 0.693 acc:label: 0.496, time: 7.203
[grad norm:][0.0545679964]
Iteration 15100, lr = 1e-02, total:loss: 1.256 reg: 0.068 loss:label: 0.693 acc:label: 0.495, time: 7.167
Iteration 15110, lr = 1e-02, total:loss: 1.255 reg: 0.068 loss:label: 0.693 acc:label: 0.495, time: 7.123
Iteration 15120, lr = 1e-02, total:loss: 1.257 reg: 0.068 loss:label: 0.693 acc:label: 0.497, time: 7.079
Iteration 15130, lr = 1e-02, total:loss: 1.259 reg: 0.068 loss:label: 0.693 acc:label: 0.498, time: 7.032
Iteration 15140, lr = 1e-02, total:loss: 1.260 reg: 0.068 loss:label: 0.693 acc:label: 0.499, time: 6.990
Iteration 15150, lr = 1e-02, total:loss: 1.259 reg: 0.068 loss:label: 0.693 acc:label: 0.498, time: 7.013
  • Is there any way I can do to speed up training? Because I only have two 1080 Ti GPU. o(╥﹏╥)o
  • I think that two-GPU training might be enough. You could also try to compensate averaging gradients over multiple minibatches to simulate having multiple GPUs, e.g. by using this helper optimizer: https://github.com/renmengye/revnet-public/blob/master/resnet/models/multi_pass_optimizer.py. There's partial support for this in the code already (set multipass = True, and set the number of batches with multipass_count).
  • I think it is helpful to pay attention to the "loss:label" value, rather than accuracy. In this case, 0.693 loss (which equals ln(0.5)) means that you have chance accuracy. It looks like this model has lower loss than your other model.
  • I had trouble training with such a large learning rate. I think that you should probably be able to train it with Adam and a 1e-4 learning rate, though.
commented

Thanks a lot, Andrew. It is really helpful. 😃

commented

Sorry to bother you, I am here again. When a shift model (e.g., 'net.tf-30000') has been trained, how to use this model for testing?
Only set "is_training" as False, and run shift_net.train? But I think maybe there is something else I should do.

class Model:
    def __init__(self, pr, sess, gpus, is_training=False, pr_test=None):
commented

Hello Andrew.

  • If I use the pre-trained model you provided as an init model for shift model training, I think the model would do better than chance, i.e., the performance would be acc > 0.5 and loss:label < 0.693, right?
  • But when I start training from the pre-trained model (net.tf-65000), the performance is strange. At the begining, the performance seems like normal, but later, the "acc" decreases until reaching 0.5 and "loss:label" stucks at 0.693.
[grad norm:][4.99999952]
Iteration 650000, lr = 1e-02, total:loss: 1.339 reg: 0.047 loss:label: 0.692 acc:label: 0.600, time: 42.686
Iteration 650010, lr = 1e-02, total:loss: 1.339 reg: 0.047 loss:label: 0.700 acc:label: 0.592, time: 38.813
Iteration 650020, lr = 1e-02, total:loss: 1.339 reg: 0.047 loss:label: 0.710 acc:label: 0.582, time: 35.310
Iteration 650030, lr = 1e-02, total:loss: 1.332 reg: 0.047 loss:label: 0.712 acc:label: 0.574, time: 32.127
Iteration 650040, lr = 1e-02, total:loss: 1.328 reg: 0.047 loss:label: 0.711 acc:label: 0.570, time: 29.258
Iteration 650050, lr = 1e-02, total:loss: 1.327 reg: 0.047 loss:label: 0.713 acc:label: 0.568, time: 26.660
Iteration 650060, lr = 1e-02, total:loss: 1.321 reg: 0.047 loss:label: 0.714 acc:label: 0.560, time: 24.311
Iteration 650070, lr = 1e-02, total:loss: 1.316 reg: 0.047 loss:label: 0.713 acc:label: 0.556, time: 22.189
Iteration 650080, lr = 1e-02, total:loss: 1.307 reg: 0.047 loss:label: 0.715 acc:label: 0.545, time: 20.270
Iteration 650090, lr = 1e-02, total:loss: 1.303 reg: 0.047 loss:label: 0.714 acc:label: 0.542, time: 18.539

......

Iteration 652000, lr = 1e-02, total:loss: 1.238 reg: 0.047 loss:label: 0.694 acc:label: 0.498, time: 1.845
Iteration 652010, lr = 1e-02, total:loss: 1.239 reg: 0.047 loss:label: 0.694 acc:label: 0.499, time: 1.855
Iteration 652020, lr = 1e-02, total:loss: 1.240 reg: 0.047 loss:label: 0.694 acc:label: 0.500, time: 1.853
Iteration 652030, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.694 acc:label: 0.503, time: 1.857
Iteration 652040, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.857
Iteration 652050, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.857
Iteration 652060, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.693 acc:label: 0.505, time: 1.856
Iteration 652070, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.693 acc:label: 0.505, time: 1.859
Iteration 652080, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.861
Iteration 652090, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.694 acc:label: 0.503, time: 1.859
[grad norm:][0.266301781]
Iteration 652100, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.857
Iteration 652110, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.501, time: 1.862
Iteration 652120, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.503, time: 1.861
Iteration 652130, lr = 1e-02, total:loss: 1.241 reg: 0.047 loss:label: 0.694 acc:label: 0.501, time: 1.860
Iteration 652140, lr = 1e-02, total:loss: 1.241 reg: 0.047 loss:label: 0.694 acc:label: 0.501, time: 1.856
Iteration 652150, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.854
Iteration 652160, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.850
Iteration 652170, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.848
Iteration 652180, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.846
Iteration 652190, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.693 acc:label: 0.503, time: 1.850
[grad norm:][0.122649804]
Iteration 652200, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.694 acc:label: 0.506, time: 1.848
Iteration 652210, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.846
Iteration 652220, lr = 1e-02, total:loss: 1.247 reg: 0.047 loss:label: 0.694 acc:label: 0.507, time: 1.847
Iteration 652230, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.693 acc:label: 0.506, time: 1.849
Iteration 652240, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.694 acc:label: 0.506, time: 1.852
Iteration 652250, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.694 acc:label: 0.506, time: 1.850
Iteration 652260, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.694 acc:label: 0.505, time: 1.852
Iteration 652270, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.693 acc:label: 0.505, time: 1.850
Iteration 652280, lr = 1e-02, total:loss: 1.248 reg: 0.047 loss:label: 0.693 acc:label: 0.508, time: 1.849
Iteration 652290, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.693 acc:label: 0.504, time: 1.846
[grad norm:][0.0564598292]
Iteration 652300, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.693 acc:label: 0.506, time: 1.845
  • Please refer to shift_example.py for an example of testing a trained network.
  • I think the loss is going up when you fine-tune because you are using a higher learning rate and (especially) a smaller batch size. The model starts out better than chance, but the parameters become worse because it's taking large steps (high learning rate) in not-so-great gradient directions (low batch size).
commented
  • Please refer to shift_example.py for an example of testing a trained network.
  • I think the loss is going up when you fine-tune because you are using a higher learning rate and (especially) a smaller batch size. The model starts out better than chance, but the parameters become worse because it's taking large steps (high learning rate) in not-so-great gradient directions (low batch size).
  • Thank you, Andrew. 🤗 Yes, I agree with you. I will try a smaller learning rate and a bigger batch size.
  • There is an example of generating a cam in shift_example.py. But I saw you reported accuracy in the paper ---"Task performance. We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%)." Actually, I want to test the model and get the accuracy result on the test dataset. Do I need to re-write this part code?
commented

There is an example of generating a cam in shift_example.py. But I saw you reported accuracy in the paper ---"Task performance. We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%)." Actually, I want to test the model and get the accuracy result on the test dataset. Do I need to re-write this part code?

This problem is solved. As your suggestion, I add a "test_accuracy" function in "class NetClf". Thanks again, Andrew. 😀

Hi @ruizewang, would you mind to share the code you use to create data file for training, I would really appreciate that