Error during Evaluation

Question

Error during Evaluation

tansangxtt opened this issue a year ago · comments

Hi, I managed to execute 2 phrases of training without any problems. But evaluation doesnot work, please check the following log. Thank you

(DiffusionRet) hai@user:~/sang$ CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 2502 --nnodes=1 --nproc_per_node=1 eval.py --workers 8 --batch_size_val 128 --anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/
MSRVTT_Videos --datatype msrvtt --max_words 32 --max_frames 12 --video_framerate 1 --diffusion_steps 50 --noise_schedule cosine --init_model best.pth --output_dir output_eval
/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated                                                                                       
and will be removed in future. Use torchrun.                                                                                                                
Note that --use-env is set by default in torchrun.                                                                                                          
If your script expects `--local-rank` argument to be set, please                                                                                            
change it to read from `os.environ['LOCAL_RANK']` instead. See                                                                                              
https://pytorch.org/docs/stable/distributed.html#launch-utility for                                                                                         
further instructions                                                          
                                                                              
  warnings.warn(                                                              
[2023-10-03 11:06:03,359 tvr 110 INFO]: local_rank: 0 world_size: 1
[2023-10-03 11:06:03,359 tvr 117 INFO]: Effective parameters:                                                                                               
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< agg_module: seqTransf
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< anno_path: data/MSR-VTT/anns
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< base_encoder: ViT-B/32
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< batch_size: 128                            
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< batch_size_val: 128                        
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< d_temp: 100            
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< datatype: msrvtt
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< device: cuda:0   
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< diffusion_steps: 50
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< distributed: 0    
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< epochs: 5                                                                                                                           
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< init_model: best.pth        
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< interaction: wti                                                                                              
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< local_rank: 0                
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< max_frames: 12                  
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< max_words: 32                   
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< neg: 0                                                                                                                              
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< noise_schedule: cosine
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< num: 127     
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< num_hidden_layers: 4                       
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< output_dir: output_eval         
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< seed: 42                                   
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< sigma_small: True           
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< t2v_alpha: 1                                                                                                                        
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< t2v_num: 32                  
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< t2v_temp: 1                                                                                                   
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< temp: 1                   
[2023-10-03 11:06:03,359 tvr 119 INFO]:   <<< v2t_alpha: 1                               
[2023-10-03 11:06:03,360 tvr 119 INFO]:   <<< v2t_num: 32                                
[2023-10-03 11:06:03,360 tvr 119 INFO]:   <<< v2t_temp: 1                                                                                                                         
[2023-10-03 11:06:03,360 tvr 119 INFO]:   <<< video_framerate: 1                         
[2023-10-03 11:06:03,360 tvr 119 INFO]:   <<< video_path: data/MSR-VTT/MSRVTT_Videos                                                                                                                                                                      
[2023-10-03 11:06:03,360 tvr 119 INFO]:   <<< workers: 8                                 
[2023-10-03 11:06:03,360 tvr 119 INFO]:   <<< world_size: 1                              
[val] Unique sentence is 995 , all num is 1000                                           
Video number: 1000                                                                       
Total Pairs: 1000                                                                                                                                                                 
[2023-10-03 11:06:10,770 tvr 159 INFO]: ***** Running test *****                         
[2023-10-03 11:06:10,770 tvr 160 INFO]:   Num examples = 1000                                                                
[2023-10-03 11:06:10,770 tvr 161 INFO]:   Batch size = 128                                                                                                                        
[2023-10-03 11:06:10,770 tvr 162 INFO]:   Num steps = 8                                  
[2023-10-03 11:06:10,770 tvr 163 INFO]: ***** Running val *****                          
[2023-10-03 11:06:10,770 tvr 164 INFO]:   Num examples = 1000                                                                                                                     
[2023-10-03 11:06:10,773 tvr 375 INFO]: [start] extract text+video feature               
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:10<00:00,  8.86s/it]                      
[2023-10-03 11:07:21,813 tvr 403 INFO]: [finish] extract text+video feature                                                                                                       
[2023-10-03 11:07:21,813 tvr 407 INFO]: 1000 1000 1000 1000                                                                                                                       
[2023-10-03 11:07:21,813 tvr 411 INFO]: [start] calculate the similarity                 
[2023-10-03 11:07:21,813 tvr 205 INFO]: [finish] map to main gpu                                                                                                                  
[2023-10-03 11:07:21,814 tvr 214 INFO]: [finish] map to main gpu                                                             
[2023-10-03 11:07:22,397 tvr 227 INFO]: diffusion                                                                                                                                 
Traceback (most recent call last):                                                       
  File "/home/hai/sang/eval.py", line 493, in <module>                                                                                                                            
    main()                                  
  File "/home/hai/sang/eval.py", line 490, in main                                                                                                                                
    eval_epoch(args, model, test_dataloader, args.device, diffusion)                                                         
  File "/home/hai/sang/eval.py", line 413, in eval_epoch                                                                                                                          
    new_t2v_matrix, new_v2t_matrix = _run_on_single_gpu(args, model, batch_mask_t,                                                                                                                                                                        
  File "/home/hai/sang/eval.py", line 255, in _run_on_single_gpu                                                             
    model.diffusion_model,                                    
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__                                                                                                                          
    raise AttributeError("'{}' object has no attribute '{}'".format(                                                         
AttributeError: 'DiffusionRet' object has no attribute 'diffusion_model'                                                     
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 259728) of binary: /home/hai/anaconda3/envs/DiffusionRet/bin/python                                                                                          
Traceback (most recent call last):                            
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/runpy.py", line 197, in _run_module_as_main                                                                                                                                                   
    return _run_code(code, main_globals, None,                                                                               
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/runpy.py", line 87, in _run_code                                                                                                                                                              
    exec(code, run_globals)                                   
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>                                                                                                                             
    main()                                                    
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main                                                                                                                                 
    launch(args)                                              
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch                                                                                                                               
    run(args)                                                 
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run                                                                                                                                     
    elastic_launch(                                           
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__                                                                                                                       
    return launch_agent(self._config, self._entrypoint, list(args))                                                          
  File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent                                                                                                                   
    raise ChildFailedError(                                   
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Peng Jin · Answer 1 · Sat Oct 07 2023 12:39:21 GMT+0800 (China Standard Time)

I'm sorry to reply to you so late. I have fixed this bug, just update the eval.py file and run it again.