Error during Evaluation
tansangxtt opened this issue · comments
Sang Ha commented
Hi, I managed to execute 2 phrases of training without any problems. But evaluation doesnot work, please check the following log. Thank you
(DiffusionRet) hai@user:~/sang$ CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 2502 --nnodes=1 --nproc_per_node=1 eval.py --workers 8 --batch_size_val 128 --anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/
MSRVTT_Videos --datatype msrvtt --max_words 32 --max_frames 12 --video_framerate 1 --diffusion_steps 50 --noise_schedule cosine --init_model best.pth --output_dir output_eval
/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-10-03 11:06:03,359 tvr 110 INFO]: local_rank: 0 world_size: 1
[2023-10-03 11:06:03,359 tvr 117 INFO]: Effective parameters:
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< agg_module: seqTransf
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< anno_path: data/MSR-VTT/anns
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< base_encoder: ViT-B/32
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< batch_size: 128
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< batch_size_val: 128
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< d_temp: 100
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< datatype: msrvtt
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< device: cuda:0
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< diffusion_steps: 50
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< distributed: 0
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< epochs: 5
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< init_model: best.pth
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< interaction: wti
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< local_rank: 0
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< max_frames: 12
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< max_words: 32
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< neg: 0
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< noise_schedule: cosine
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< num: 127
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< num_hidden_layers: 4
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< output_dir: output_eval
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< seed: 42
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< sigma_small: True
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< t2v_alpha: 1
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< t2v_num: 32
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< t2v_temp: 1
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< temp: 1
[2023-10-03 11:06:03,359 tvr 119 INFO]: <<< v2t_alpha: 1
[2023-10-03 11:06:03,360 tvr 119 INFO]: <<< v2t_num: 32
[2023-10-03 11:06:03,360 tvr 119 INFO]: <<< v2t_temp: 1
[2023-10-03 11:06:03,360 tvr 119 INFO]: <<< video_framerate: 1
[2023-10-03 11:06:03,360 tvr 119 INFO]: <<< video_path: data/MSR-VTT/MSRVTT_Videos
[2023-10-03 11:06:03,360 tvr 119 INFO]: <<< workers: 8
[2023-10-03 11:06:03,360 tvr 119 INFO]: <<< world_size: 1
[val] Unique sentence is 995 , all num is 1000
Video number: 1000
Total Pairs: 1000
[2023-10-03 11:06:10,770 tvr 159 INFO]: ***** Running test *****
[2023-10-03 11:06:10,770 tvr 160 INFO]: Num examples = 1000
[2023-10-03 11:06:10,770 tvr 161 INFO]: Batch size = 128
[2023-10-03 11:06:10,770 tvr 162 INFO]: Num steps = 8
[2023-10-03 11:06:10,770 tvr 163 INFO]: ***** Running val *****
[2023-10-03 11:06:10,770 tvr 164 INFO]: Num examples = 1000
[2023-10-03 11:06:10,773 tvr 375 INFO]: [start] extract text+video feature
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:10<00:00, 8.86s/it]
[2023-10-03 11:07:21,813 tvr 403 INFO]: [finish] extract text+video feature
[2023-10-03 11:07:21,813 tvr 407 INFO]: 1000 1000 1000 1000
[2023-10-03 11:07:21,813 tvr 411 INFO]: [start] calculate the similarity
[2023-10-03 11:07:21,813 tvr 205 INFO]: [finish] map to main gpu
[2023-10-03 11:07:21,814 tvr 214 INFO]: [finish] map to main gpu
[2023-10-03 11:07:22,397 tvr 227 INFO]: diffusion
Traceback (most recent call last):
File "/home/hai/sang/eval.py", line 493, in <module>
main()
File "/home/hai/sang/eval.py", line 490, in main
eval_epoch(args, model, test_dataloader, args.device, diffusion)
File "/home/hai/sang/eval.py", line 413, in eval_epoch
new_t2v_matrix, new_v2t_matrix = _run_on_single_gpu(args, model, batch_mask_t,
File "/home/hai/sang/eval.py", line 255, in _run_on_single_gpu
model.diffusion_model,
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DiffusionRet' object has no attribute 'diffusion_model'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 259728) of binary: /home/hai/anaconda3/envs/DiffusionRet/bin/python
Traceback (most recent call last):
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Peng Jin commented
I'm sorry to reply to you so late. I have fixed this bug, just update the eval.py file and run it again.