ZENGXH / DMM_Net

when i run python train.py, the following error occurs:

2019-10-16 09:48:22,325-{train.py:384}-INFO-[model_name] ytb_r50_w11
2019-10-16 09:48:22,325-{train.py:385}-INFO-get number of gpu: 1
2019-10-16 09:48:23,989-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-16 09:48:23,996-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-16 09:48:24,041-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-16 09:48:24,047-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-16 09:48:24,050-{train.py:152}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 10, 'relax_proj_iter': 5, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}}
2019-10-16 09:48:26,367-{trainer.py:63}-INFO-load from json_data; num vid 3000
2019-10-16 09:48:26,367-{train.py:154}-INFO-init model 4.042
2019-10-16 09:48:26,369-{train.py:161}-INFO-optimizer 0.001
2019-10-16 09:48:26,369-{train.py:163}-INFO-[enc_opt] len: 2; len for each param group: [48, 161]
2019-10-16 09:48:26,369-{train.py:165}-INFO-[dec_opt] len: 1; len for each param group: [10]
2019-10-16 09:48:26,371-{train.py:213}-INFO-save args in experiments/ytb_r50_w11/10-16-09-48args.pkl
2019-10-16 09:48:26,371-{train.py:214}-INFO-Namespace(augment=False, base_model='resnet50', batch_size=4, best_val_loss=0, cache_data=1, config_train='dmm/configs/train.yaml', dataset='youtube', davis_eval_folder='', device=device(type='cuda', index=0), distributed=0, distributed_manully=0, distributed_manully_Nrep=0, distributed_manully_rank=0, dropout=0.0, epoch_resume=0, eval_flag='pred', eval_split='trainval', finetune_after=0, gpu_id=0, gt_maxseqlen=5, hidden_size=128, imsize=480, iou_weight=1.0, kernel_size=3, length_clip=3, load_proposals=1, load_proposals_dataset=1, local_rank=0, log_file='train.log', log_term=False, loss_weight_iouraw=18.0, loss_weight_match=1.0, lr=0.001, lr_cnn=0.0001, lr_decoder=0.001, mask_th=0.5, max_dets=100, max_epoch=100, max_eval_iter=800, maxseqlen=5, min_delta=0.0, min_size=0.001, model_dir='experiments/ytb_r50_w11', model_name='ytb_r50_w11', models_root='experiments/', momentum=0.9, my_augment=False, ngpus=1, num_classes=21, num_workers=4, only_spatial=False, only_temporal=False, optim='adam', optim_cnn='adam', overwrite_loadargs=1, pad_video=0, patience=15, patience_stop=60, pred_offline_meta='data/ytb_vos/splits_813_3k_trainvaltest/meta_vid_frame_2_predid.json', pred_offline_path=['experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth'], pred_offline_path_eval=None, prev_mask_d=1, print_every=2, random_select_frames=0, resize=False, resume=False, resume_path='epoxx_iterxxxx', rotation=10, sample_inference_mask=0, save_every=3000, seed=123, shear=0.1, single_object=False, skip_empty_starting_frame=0, skip_mode='concat', test=0, test_image_h=256, test_image_w=448, test_model_path='', threshold_mask=0.4, train_h=255, train_split='train', train_w=448, translation=0.1, update_encoder=1, use_gpu=True, use_refmask=0, weight_decay=1e-06, weight_decay_cnn=1e-06, year='2017', youtube_dir='../../databases/YouTubeVOS/', zoom=0.7)
2019-10-16 09:48:26,372-{train.py:223}-INFO-init_dataloaders
2019-10-16 09:48:26,412-{dataset.py:119}-INFO-[train] loading offline from experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth; Nf ['experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth']
2019-10-16 09:48:27,298-{dataset.py:125}-INFO-+new_parts 200: 0.8864507675170898
2019-10-16 09:48:27,303-{dataset.py:133}-INFO-load offline use 0.89 | len 200
2019-10-16 09:48:27,320-{youtubeVOS.py:84}-INFO-[dataset] phase read train; len of db seq 3000
2019-10-16 09:48:27,320-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-16 09:48:27,321-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl; it will take a while to cache the data
2019-10-16 10:17:41,177-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl
2019-10-16 10:18:03,726-{youtubeVOS.py:125}-INFO-load lmdb 1776.42
Traceback (most recent call last):
File "/home/zhanglin/Research/codes/2020/DMM_Net/train.py", line 403, in
trainIters(args)
File "/home/zhanglin/Research/codes/2020/DMM_Net/train.py", line 225, in trainIters
loaders = init_dataloaders(args)
File "/home/zhanglin/Research/codes/2020/DMM_Net/train.py", line 86, in init_dataloaders
use_prev_mask = False)
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/dataloader/dataset_utils.py", line 17, in get_dataset
use_prev_mask = use_prev_mask)
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/dataloader/youtubeVOS.py", line 157, in init
images_valid = [fname for img, fname in zip(images, seq.files) if self.countobj[dbname][img] > 0 ]
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/dataloader/youtubeVOS.py", line 157, in
images_valid = [fname for img, fname in zip(images, seq.files) if self.countobj[dbname][img] > 0 ]
KeyError: '003234408d'

Process finished with exit code 1

i don't know why....looking forward to your reply

It looks like the training dataloader is loading a wrong proposal file (i.e. the proposal file it loaded is for eval instead of training: experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth).

Could you check the flag pred_offline_path? It should point to something like experiments/proposals/coco81/inference/youtubevos_train3k_meta/asdict_50/pred_DICT.pth.

You may also want to check the scripts/train/*sh file.

Thank you! i solved the problem.
however, i run into another one....

2019-10-16 11:07:26,158-{train.py:384}-INFO-[model_name] ytb_r50_w11
2019-10-16 11:07:26,159-{train.py:385}-INFO-get number of gpu: 1
2019-10-16 11:07:28,154-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-16 11:07:28,160-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-16 11:07:28,188-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-16 11:07:28,191-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-16 11:07:28,193-{train.py:152}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 10, 'relax_proj_iter': 5, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}}
2019-10-16 11:07:30,232-{trainer.py:63}-INFO-load from json_data; num vid 3000
2019-10-16 11:07:30,232-{train.py:154}-INFO-init model 4.073
2019-10-16 11:07:30,234-{train.py:161}-INFO-optimizer 0.001
2019-10-16 11:07:30,234-{train.py:163}-INFO-[enc_opt] len: 2; len for each param group: [48, 161]
2019-10-16 11:07:30,234-{train.py:165}-INFO-[dec_opt] len: 1; len for each param group: [10]
2019-10-16 11:07:30,235-{train.py:213}-INFO-save args in experiments/ytb_r50_w11/10-16-11-07args.pkl
2019-10-16 11:07:30,236-{train.py:214}-INFO-Namespace(augment=False, base_model='resnet50', batch_size=4, best_val_loss=0, cache_data=1, config_train='dmm/configs/train.yaml', dataset='youtube', davis_eval_folder='', device=device(type='cuda', index=0), distributed=0, distributed_manully=0, distributed_manully_Nrep=0, distributed_manully_rank=0, dropout=0.0, epoch_resume=0, eval_flag='pred', eval_split='trainval', finetune_after=0, gpu_id=0, gt_maxseqlen=5, hidden_size=128, imsize=480, iou_weight=1.0, kernel_size=3, length_clip=3, load_proposals=1, load_proposals_dataset=1, local_rank=0, log_file='train.log', log_term=False, loss_weight_iouraw=18.0, loss_weight_match=1.0, lr=0.001, lr_cnn=0.0001, lr_decoder=0.001, mask_th=0.5, max_dets=100, max_epoch=100, max_eval_iter=800, maxseqlen=5, min_delta=0.0, min_size=0.001, model_dir='experiments/ytb_r50_w11', model_name='ytb_r50_w11', models_root='experiments/', momentum=0.9, my_augment=False, ngpus=1, num_classes=21, num_workers=4, only_spatial=False, only_temporal=False, optim='adam', optim_cnn='adam', overwrite_loadargs=1, pad_video=0, patience=15, patience_stop=60, pred_offline_meta='data/ytb_vos/splits_813_3k_trainvaltest/meta_vid_frame_2_predid.json', pred_offline_path=['experiments/proposals/coco81/inference/youtubevos_train3k_meta/asdict_50/videos/'], pred_offline_path_eval=['experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth'], prev_mask_d=1, print_every=2, random_select_frames=0, resize=False, resume=False, resume_path='epoxx_iterxxxx', rotation=10, sample_inference_mask=0, save_every=3000, seed=123, shear=0.1, single_object=False, skip_empty_starting_frame=0, skip_mode='concat', test=0, test_image_h=256, test_image_w=448, test_model_path='', threshold_mask=0.4, train_h=255, train_split='train', train_w=448, translation=0.1, update_encoder=1, use_gpu=True, use_refmask=0, weight_decay=1e-06, weight_decay_cnn=1e-06, year='2017', youtube_dir='../../databases/YouTubeVOS/', zoom=0.7)
2019-10-16 11:07:30,236-{train.py:223}-INFO-init_dataloaders
2019-10-16 11:07:30,301-{youtubeVOS.py:84}-INFO-[dataset] phase read train; len of db seq 3000
2019-10-16 11:07:30,301-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-16 11:07:30,302-{youtubeVOS.py:107}-INFO-try to load in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl
2019-10-16 11:07:31,790-{youtubeVOS.py:125}-INFO-load lmdb 1.51
2019-10-16 11:07:32,086-{youtubeVOS.py:161}-INFO-filtered images out -> 444 for #vid 3000
2019-10-16 11:07:32,333-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.54; cliplen 3| annotation clip 26320(skip 0)| videos 3000
2019-10-16 11:07:32,368-{youtubeVOS.py:265}-INFO-load keys 0.04
2019-10-16 11:07:32,369-{train.py:104}-INFO-INPUT shape: 255 448
2019-10-16 11:07:32,445-{dataset.py:119}-INFO-[trainval] loading offline from experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth; Nf ['experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth']
2019-10-16 11:07:37,958-{dataset.py:125}-INFO-+new_parts 200: 5.512281894683838
2019-10-16 11:07:37,965-{dataset.py:133}-INFO-load offline use 5.52 | len 200
2019-10-16 11:07:37,967-{youtubeVOS.py:84}-INFO-[dataset] phase read trainval; len of db seq 200
2019-10-16 11:07:37,968-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-16 11:07:37,968-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl; it will take a while to cache the data
2019-10-16 11:08:51,875-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl
2019-10-16 11:08:53,080-{youtubeVOS.py:125}-INFO-load lmdb 75.11
2019-10-16 11:08:53,101-{youtubeVOS.py:161}-INFO-filtered images out -> 0 for #vid 200
2019-10-16 11:08:53,116-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.04; cliplen 3| annotation clip 800(skip 0)| videos 200
2019-10-16 11:08:53,120-{youtubeVOS.py:265}-INFO-load keys 0.00
2019-10-16 11:08:53,121-{train.py:104}-INFO-INPUT shape: 255 448
2019-10-16 11:08:53,121-{train.py:228}-INFO-dataloader 82.885
2019-10-16 11:08:53,121-{train.py:234}-INFO-==========> start sample_inference_mask
2019-10-16 11:08:53,122-{train.py:249}-INFO-epoch 0 - trainval;
2019-10-16 11:08:53,122-{train.py:251}-INFO--- loss weight loss_weight_match: 1.0 loss_weight_iouraw 18.0;
Traceback (most recent call last):
File "/home/zhanglin/Research/codes/2020/DMM_Net/train.py", line 403, in
trainIters(args)
File "/home/zhanglin/Research/codes/2020/DMM_Net/train.py", line 276, in trainIters
loss, losses = trainer(batch_idx, inputs, imgs_names, targets, seq_name, starting_frame, split, args, proposals)
File "/home/zhanglin/anaconda3/envs/pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/modules/trainer.py", line 122, in forward
prev_thid_list=prev_thid_list, prev_mask=prev_mask, predid_cur_frames=predid_cur_frames, proposal_cur=proposal_cur)
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/modules/trainer.py", line 182, in forward_timestep
init_pred_inst, tplt_dict, match_loss, mask_last_occur = self.DMM(args, proposals=proposals, backbone_feature=features['backbone_feature'], mask_last_occurence=mask_last_occur, tplt_dict=tplt_dict, tplt_valid_batch=tplt_valid_batch, targets=dmm_target)
File "/home/zhanglin/anaconda3/envs/pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/modules/dmm_model.py", line 123, in forward
CHECKEQ(prop_m[bid].shape[-2:], mask_last_occurence[bid].shape[-2:])
File "/home/zhanglin/Research/codes/2020/DMM_Net/dmm/utils/checker.py", line 27, in CHECKEQ
assert(a == b), 'get {} {}'.format(a, b)
AssertionError: get torch.Size([255, 448]) torch.Size([448, 255])

Process finished with exit code 1

I have checked the agrs parameters, the train_h is 255, while the train_w is 448...

It looks like the predicted mask is in 448x255. I suspect there may be some problem with scipy resize in dataset.py, could you try to print out the size of image? i.e. after this line

DMM_Net/dmm/dataloader/dataset.py

Line 208 in fa4a222

img = imresize(img, self.inputRes)

. The last two dimension should be 255x448.

If you get 448x255, could you pull the lastest code and try again? I remove scipy resize and use PIL resize instead. Hopefully it can fix the problem.
If you get 255x488, please let me know and I will look into it ;)

Using you latest dataset.py, now I can train the model! thank you!

cool :)