tiandunx / loss_function_search

Loss Function Search for Face Recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some questions about datasets

mzmzdcr opened this issue · comments

commented

Hello!
Thank you very much for releasing the code.
My question is What should I fill in line7/8 of train.sh.
train_lmdb='path to your training lmdb, it is comma separated'
train_files='path to your training kv text file, each line has 2 field. lmdb_key and label, there are only 1 space between them'
I have made data.mdb and lock.mdb of CASIA-webface dataset.

@mzmzdcr Well, assume that your lmdb has the following structure
casia_lmdb
-- data.mdb
-- lock.mdb
Then you should have another text file, say casia_kv.txt, describing key (used to index casia_lmdb) and label, each line of which is
caisa_0_0 0
casia_1_0 1
casia_0_1 0
Note casia_0_0 is the key to your lmdb as lmdb is a Key Value pair database. Line 51 in lmdb_utils/lmdb_dataset.py shows the usage of the key value pairs. Then you can fill in line 7/8 of train.sh as

  1. train_lmdb=casia_lmdb
  2. train_file=casia_kv.txt
commented

Thanks for your reply, I think I have solved the dataset problem
Now I have run the experiment smoothly, but there are some problems that need your help
The training log for process 1 is as follows:

INFO 2021-06-10 13:33:50 search.py: 62] rank 0, epoch 0, iter 0, lr 0.100000, loss 21.082306
INFO 2021-06-10 13:33:50 search.py: 67] save checkpoint epoch_0_batch_0.pt to disk...
INFO 2021-06-10 13:37:09 search.py: 62] rank 0, epoch 0, iter 200, lr 0.100000, loss 19.041899
INFO 2021-06-10 13:41:21 search.py: 62] rank 0, epoch 0, iter 400, lr 0.100000, loss 18.097427
INFO 2021-06-10 13:48:47 search.py: 62] rank 0, epoch 0, iter 600, lr 0.100000, loss 17.036356
INFO 2021-06-10 13:56:22 search.py: 62] rank 0, epoch 0, iter 800, lr 0.100000, loss 16.676279
INFO 2021-06-10 14:01:37 search.py: 62] rank 0, epoch 0, iter 1000, lr 0.100000, loss 15.947204
INFO 2021-06-10 14:04:59 search.py: 71] save checkpoint epoch_0.pt to disk...
INFO 2021-06-10 14:05:22 search.py: 62] rank 0, epoch 1, iter 13, lr 0.100000, loss 15.455211
INFO 2021-06-10 14:11:47 search.py: 62] rank 0, epoch 1, iter 213, lr 0.100000, loss 14.118228
INFO 2021-06-10 14:20:54 search.py: 62] rank 0, epoch 1, iter 413, lr 0.100000, loss 13.590545
INFO 2021-06-10 14:25:37 search.py: 62] rank 0, epoch 1, iter 613, lr 0.100000, loss 13.473572
INFO 2021-06-10 14:30:19 search.py: 62] rank 0, epoch 1, iter 813, lr 0.100000, loss 12.233972
INFO 2021-06-10 14:38:23 search.py: 62] rank 0, epoch 1, iter 1013, lr 0.100000, loss 12.394306
INFO 2021-06-10 14:46:07 search.py: 71] save checkpoint epoch_1.pt to disk...
INFO 2021-06-10 14:46:51 search.py: 62] rank 0, epoch 2, iter 26, lr 0.100000, loss 9.513170
INFO 2021-06-10 14:52:00 search.py: 62] rank 0, epoch 2, iter 226, lr 0.100000, loss 8.766737
INFO 2021-06-10 14:59:39 search.py: 62] rank 0, epoch 2, iter 426, lr 0.100000, loss 9.048544
INFO 2021-06-10 15:07:31 search.py: 62] rank 0, epoch 2, iter 626, lr 0.100000, loss 9.172847
INFO 2021-06-10 15:12:22 search.py: 62] rank 0, epoch 2, iter 826, lr 0.100000, loss 8.642149
INFO 2021-06-10 15:17:30 search.py: 62] rank 0, epoch 2, iter 1026, lr 0.100000, loss 8.481071
INFO 2021-06-10 15:24:13 search.py: 71] save checkpoint epoch_2.pt to disk...
INFO 2021-06-10 15:25:04 search.py: 99] rank: 0, acc = 0.820333
0
INFO 2021-06-10 15:26:51 search.py: 62] rank 0, epoch 3, iter 39, lr 0.010000, loss 8.150521
INFO 2021-06-10 15:34:53 search.py: 62] rank 0, epoch 3, iter 239, lr 0.010000, loss 7.723393
INFO 2021-06-10 15:39:28 search.py: 62] rank 0, epoch 3, iter 439, lr 0.010000, loss 7.831100
INFO 2021-06-10 15:43:23 search.py: 62] rank 0, epoch 3, iter 639, lr 0.010000, loss 7.431253
INFO 2021-06-10 15:49:47 search.py: 62] rank 0, epoch 3, iter 839, lr 0.010000, loss 7.331719
INFO 2021-06-10 15:58:51 search.py: 62] rank 0, epoch 3, iter 1039, lr 0.010000, loss 7.512514
INFO 2021-06-10 16:04:03 search.py: 71] save checkpoint epoch_3.pt to disk...
INFO 2021-06-10 16:04:41 search.py: 99] rank: 0, acc = 0.831833
0
INFO 2021-06-10 16:06:11 search.py: 62] rank 0, epoch 4, iter 52, lr 0.010000, loss 8.300632
INFO 2021-06-10 16:11:06 search.py: 62] rank 0, epoch 4, iter 252, lr 0.010000, loss 8.055165
INFO 2021-06-10 16:11:06 search.py: 67] save checkpoint epoch_4_batch_252.pt to disk...
INFO 2021-06-10 16:17:33 search.py: 62] rank 0, epoch 4, iter 452, lr 0.010000, loss 7.857666
INFO 2021-06-10 16:25:31 search.py: 62] rank 0, epoch 4, iter 652, lr 0.010000, loss 7.554319
INFO 2021-06-10 16:31:39 search.py: 62] rank 0, epoch 4, iter 852, lr 0.010000, loss 7.578086
INFO 2021-06-10 16:40:21 search.py: 62] rank 0, epoch 4, iter 1052, lr 0.010000, loss 8.165893
INFO 2021-06-10 16:45:46 search.py: 71] save checkpoint epoch_4.pt to disk...
INFO 2021-06-10 16:46:23 search.py: 99] rank: 0, acc = 0.829167
1
INFO 2021-06-10 16:48:15 search.py: 62] rank 0, epoch 5, iter 65, lr 0.010000, loss 5.935589
INFO 2021-06-10 16:52:23 search.py: 62] rank 0, epoch 5, iter 265, lr 0.010000, loss 5.810400
INFO 2021-06-10 16:57:34 search.py: 62] rank 0, epoch 5, iter 465, lr 0.010000, loss 5.632618
INFO 2021-06-10 17:05:35 search.py: 62] rank 0, epoch 5, iter 665, lr 0.010000, loss 5.919451
INFO 2021-06-10 17:14:10 search.py: 62] rank 0, epoch 5, iter 865, lr 0.010000, loss 6.124039
INFO 2021-06-10 17:18:46 search.py: 62] rank 0, epoch 5, iter 1065, lr 0.010000, loss 5.422663
INFO 2021-06-10 17:21:34 search.py: 71] save checkpoint epoch_5.pt to disk...
INFO 2021-06-10 17:22:16 search.py: 99] rank: 0, acc = 0.835833
0
INFO 2021-06-10 17:25:49 search.py: 62] rank 0, epoch 6, iter 78, lr 0.010000, loss 5.789560
INFO 2021-06-10 17:34:43 search.py: 62] rank 0, epoch 6, iter 278, lr 0.010000, loss 5.733936
INFO 2021-06-10 17:43:00 search.py: 62] rank 0, epoch 6, iter 478, lr 0.010000, loss 5.452862
INFO 2021-06-10 17:46:26 search.py: 62] rank 0, epoch 6, iter 678, lr 0.010000, loss 5.205093
INFO 2021-06-10 17:51:05 search.py: 62] rank 0, epoch 6, iter 878, lr 0.010000, loss 5.725134
INFO 2021-06-10 18:00:45 search.py: 62] rank 0, epoch 6, iter 1078, lr 0.010000, loss 5.539200
INFO 2021-06-10 18:05:30 search.py: 71] save checkpoint epoch_6.pt to disk...
INFO 2021-06-10 18:06:16 search.py: 99] rank: 0, acc = 0.836667
0
INFO 2021-06-10 18:08:53 search.py: 62] rank 0, epoch 7, iter 91, lr 0.001000, loss 6.274435
INFO 2021-06-10 18:15:05 search.py: 62] rank 0, epoch 7, iter 291, lr 0.001000, loss 6.578427
INFO 2021-06-10 18:21:55 search.py: 62] rank 0, epoch 7, iter 491, lr 0.001000, loss 6.701453
INFO 2021-06-10 18:30:02 search.py: 62] rank 0, epoch 7, iter 691, lr 0.001000, loss 6.760352
INFO 2021-06-10 18:37:07 search.py: 62] rank 0, epoch 7, iter 891, lr 0.001000, loss 6.446653
INFO 2021-06-10 18:43:09 search.py: 62] rank 0, epoch 7, iter 1091, lr 0.001000, loss 6.263428
INFO 2021-06-10 18:46:14 search.py: 71] save checkpoint epoch_7.pt to disk...
INFO 2021-06-10 18:46:58 search.py: 99] rank: 0, acc = 0.834000
2
INFO 2021-06-10 18:51:11 search.py: 62] rank 0, epoch 8, iter 104, lr 0.001000, loss 5.722060
INFO 2021-06-10 18:58:19 search.py: 62] rank 0, epoch 8, iter 304, lr 0.001000, loss 5.849984
INFO 2021-06-10 19:05:32 search.py: 62] rank 0, epoch 8, iter 504, lr 0.001000, loss 6.244881
INFO 2021-06-10 19:05:32 search.py: 67] save checkpoint epoch_8_batch_504.pt to disk...
INFO 2021-06-10 19:14:48 search.py: 62] rank 0, epoch 8, iter 704, lr 0.001000, loss 6.063056
INFO 2021-06-10 19:23:23 search.py: 62] rank 0, epoch 8, iter 904, lr 0.001000, loss 5.650560
INFO 2021-06-10 19:28:30 search.py: 62] rank 0, epoch 8, iter 1104, lr 0.001000, loss 6.255665
INFO 2021-06-10 19:31:06 search.py: 71] save checkpoint epoch_8.pt to disk...
INFO 2021-06-10 19:31:48 search.py: 99] rank: 0, acc = 0.831333
1
INFO 2021-06-10 19:35:42 search.py: 62] rank 0, epoch 9, iter 117, lr 0.000100, loss 6.794458
INFO 2021-06-10 19:41:53 search.py: 62] rank 0, epoch 9, iter 317, lr 0.000100, loss 6.788669
INFO 2021-06-10 19:48:44 search.py: 62] rank 0, epoch 9, iter 517, lr 0.000100, loss 6.870445
INFO 2021-06-10 19:56:39 search.py: 62] rank 0, epoch 9, iter 717, lr 0.000100, loss 6.480278
INFO 2021-06-10 20:06:33 search.py: 62] rank 0, epoch 9, iter 917, lr 0.000100, loss 6.669956
INFO 2021-06-10 20:16:38 search.py: 62] rank 0, epoch 9, iter 1117, lr 0.000100, loss 6.382141
INFO 2021-06-10 20:19:01 search.py: 71] save checkpoint epoch_9.pt to disk...
INFO 2021-06-10 20:19:41 search.py: 99] rank: 0, acc = 0.833500
2
INFO 2021-06-10 20:23:20 search.py: 62] rank 0, epoch 10, iter 130, lr 0.000100, loss 6.010046
INFO 2021-06-10 20:28:25 search.py: 62] rank 0, epoch 10, iter 330, lr 0.000100, loss 5.925373
INFO 2021-06-10 20:33:31 search.py: 62] rank 0, epoch 10, iter 530, lr 0.000100, loss 5.933987
INFO 2021-06-10 20:39:41 search.py: 62] rank 0, epoch 10, iter 730, lr 0.000100, loss 6.227385
INFO 2021-06-10 20:48:51 search.py: 62] rank 0, epoch 10, iter 930, lr 0.000100, loss 6.199211
INFO 2021-06-10 20:59:23 search.py: 62] rank 0, epoch 10, iter 1130, lr 0.000100, loss 5.879917
INFO 2021-06-10 21:02:02 search.py: 71] save checkpoint epoch_10.pt to disk...
INFO 2021-06-10 21:03:03 search.py: 99] rank: 0, acc = 0.833833
2
INFO 2021-06-10 21:08:38 search.py: 62] rank 0, epoch 11, iter 143, lr 0.000100, loss 5.863217
INFO 2021-06-10 21:14:26 search.py: 62] rank 0, epoch 11, iter 343, lr 0.000100, loss 5.919258
INFO 2021-06-10 21:21:23 search.py: 62] rank 0, epoch 11, iter 543, lr 0.000100, loss 5.570979
INFO 2021-06-10 21:25:08 search.py: 62] rank 0, epoch 11, iter 743, lr 0.000100, loss 5.663806
INFO 2021-06-10 21:29:22 search.py: 62] rank 0, epoch 11, iter 943, lr 0.000100, loss 6.057225
INFO 2021-06-10 21:38:51 search.py: 62] rank 0, epoch 11, iter 1143, lr 0.000100, loss 5.705087
INFO 2021-06-10 21:41:03 search.py: 71] save checkpoint epoch_11.pt to disk...
INFO 2021-06-10 21:41:59 search.py: 99] rank: 0, acc = 0.834833
0
INFO 2021-06-10 21:49:29 search.py: 62] rank 0, epoch 12, iter 156, lr 0.000100, loss 4.883880
INFO 2021-06-10 21:58:17 search.py: 62] rank 0, epoch 12, iter 356, lr 0.000100, loss 4.925846
INFO 2021-06-10 22:04:32 search.py: 62] rank 0, epoch 12, iter 556, lr 0.000100, loss 4.751471
INFO 2021-06-10 22:10:20 search.py: 62] rank 0, epoch 12, iter 756, lr 0.000100, loss 5.083681
INFO 2021-06-10 22:10:20 search.py: 67] save checkpoint epoch_12_batch_756.pt to disk...
INFO 2021-06-10 22:15:51 search.py: 62] rank 0, epoch 12, iter 956, lr 0.000100, loss 5.311618
INFO 2021-06-10 22:23:25 search.py: 62] rank 0, epoch 12, iter 1156, lr 0.000100, loss 4.515859
INFO 2021-06-10 22:24:22 search.py: 71] save checkpoint epoch_12.pt to disk...
INFO 2021-06-10 22:25:06 search.py: 99] rank: 0, acc = 0.833833
0
INFO 2021-06-10 22:30:47 search.py: 62] rank 0, epoch 13, iter 169, lr 0.000100, loss 5.749384
INFO 2021-06-10 22:39:34 search.py: 62] rank 0, epoch 13, iter 369, lr 0.000100, loss 5.599656
INFO 2021-06-10 22:48:16 search.py: 62] rank 0, epoch 13, iter 569, lr 0.000100, loss 5.967265
INFO 2021-06-10 22:53:42 search.py: 62] rank 0, epoch 13, iter 769, lr 0.000100, loss 6.018396
INFO 2021-06-10 23:00:59 search.py: 62] rank 0, epoch 13, iter 969, lr 0.000100, loss 5.945544
INFO 2021-06-10 23:08:21 search.py: 62] rank 0, epoch 13, iter 1169, lr 0.000100, loss 5.319615
INFO 2021-06-10 23:08:54 search.py: 71] save checkpoint epoch_13.pt to disk...
INFO 2021-06-10 23:09:48 search.py: 99] rank: 0, acc = 0.835333
2
INFO 2021-06-10 23:18:22 search.py: 62] rank 0, epoch 14, iter 182, lr 0.000100, loss 4.099031
INFO 2021-06-10 23:26:16 search.py: 62] rank 0, epoch 14, iter 382, lr 0.000100, loss 4.233051
INFO 2021-06-10 23:34:00 search.py: 62] rank 0, epoch 14, iter 582, lr 0.000100, loss 4.159917
INFO 2021-06-10 23:39:59 search.py: 62] rank 0, epoch 14, iter 782, lr 0.000100, loss 3.952560
INFO 2021-06-10 23:44:57 search.py: 62] rank 0, epoch 14, iter 982, lr 0.000100, loss 3.596472
INFO 2021-06-10 23:53:22 search.py: 62] rank 0, epoch 14, iter 1182, lr 0.000100, loss 4.139269
INFO 2021-06-10 23:53:27 search.py: 71] save checkpoint epoch_14.pt to disk...
INFO 2021-06-10 23:54:12 search.py: 99] rank: 0, acc = 0.836167
0
INFO 2021-06-11 00:02:17 search.py: 62] rank 0, epoch 15, iter 195, lr 0.000100, loss 4.626473
INFO 2021-06-11 00:12:10 search.py: 62] rank 0, epoch 15, iter 395, lr 0.000100, loss 4.800592
INFO 2021-06-11 00:21:40 search.py: 62] rank 0, epoch 15, iter 595, lr 0.000100, loss 3.892074
INFO 2021-06-11 00:26:55 search.py: 62] rank 0, epoch 15, iter 795, lr 0.000100, loss 4.925725
INFO 2021-06-11 00:30:21 search.py: 62] rank 0, epoch 15, iter 995, lr 0.000100, loss 4.754475
INFO 2021-06-11 00:38:36 search.py: 71] save checkpoint epoch_15.pt to disk...
INFO 2021-06-11 00:39:20 search.py: 99] rank: 0, acc = 0.834833
2
INFO 2021-06-11 00:39:42 search.py: 62] rank 0, epoch 16, iter 8, lr 0.000100, loss 4.961492
INFO 2021-06-11 00:46:00 search.py: 62] rank 0, epoch 16, iter 208, lr 0.000100, loss 4.810739
INFO 2021-06-11 00:55:38 search.py: 62] rank 0, epoch 16, iter 408, lr 0.000100, loss 5.026876
INFO 2021-06-11 01:05:38 search.py: 62] rank 0, epoch 16, iter 608, lr 0.000100, loss 5.032778
INFO 2021-06-11 01:15:05 search.py: 62] rank 0, epoch 16, iter 808, lr 0.000100, loss 5.491823
INFO 2021-06-11 01:18:10 search.py: 62] rank 0, epoch 16, iter 1008, lr 0.000100, loss 5.198100
INFO 2021-06-11 01:18:10 search.py: 67] save checkpoint epoch_16_batch_1008.pt to disk...
INFO 2021-06-11 01:21:19 search.py: 71] save checkpoint epoch_16.pt to disk...
INFO 2021-06-11 01:22:06 search.py: 99] rank: 0, acc = 0.834000
2
INFO 2021-06-11 01:22:58 search.py: 62] rank 0, epoch 17, iter 21, lr 0.000100, loss 6.530173
INFO 2021-06-11 01:30:31 search.py: 62] rank 0, epoch 17, iter 221, lr 0.000100, loss 6.425526
INFO 2021-06-11 01:38:17 search.py: 62] rank 0, epoch 17, iter 421, lr 0.000100, loss 6.276905
INFO 2021-06-11 01:48:48 search.py: 62] rank 0, epoch 17, iter 621, lr 0.000100, loss 6.188421
INFO 2021-06-11 01:59:58 search.py: 62] rank 0, epoch 17, iter 821, lr 0.000100, loss 6.297678

You can find that the performance of the model has barely improved and loss only falls in the early period and fluctuate in the late period. I think there is something wrong. I made a change to distributed training in the code. Could you give me some advice on how to get to the root of the problem? Thank you!!!!

@mzmzdcr Seems that the network falls into trivial solution.

commented

@mzmzdcr Seems that the network falls into a trivial solution.

It may be as you said, but there is a strange problem that loss decreases significantly in the early training period, while performance basically does not improve. I think it may be caused by the following situation: after the end of each epoch, network parameters were actually not broadcast, resulting in the network being randomly initialized at the beginning of each epoch, and only the loss function was changed.
I will check my code