Gengzigang / PCT

This is an official implementation of our CVPR 2023 paper "Human Pose as Compositional Tokens" (https://arxiv.org/pdf/2303.11638.pdf)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

分布式训练

cndivecat opened this issue · comments

请问该模型只能使用分布式训练吗?如何不使用分布式训练使用该模型?

commented

well you have to put the parameter distributed = False
in your train and test file you have these lines:

 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""

replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file)
and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

well you have to put the parameter distributed = False in your train and test file you have these lines:

 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""

replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

您好,关于PCT这个项目可以联系下吗

commented

@qiushanjun yes if you want

@gmk11 follow your step but I find another error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_clamp_Tensor)

commented

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU
To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

Thanks you,I have solven this question.I think this is the swin's problem.
pct_swin_v2.py line 322:
logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

commented

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

Thanks you,I have solven this question.I think this is the swin's problem. pct_swin_v2.py line 322: logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

cool , i hope everything works fine now

well you have to put the parameter distributed = False in your train and test file you have these lines:

 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""

replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

For me, it seems only replacing the above code with distributed = False would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params) it worked.
But I feel like it's not non-distributed training 😭
I would appreciate it if you can give me some idea. Thanks!

commented

well you have to put the parameter distributed = False in your train and test file you have these lines:

 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""

replace them by distributed = False
also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'
it worked for me

For me, it seems only replacing the above code with distributed = False would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params) it worked. But I feel like it's not non-distributed training sob I would appreciate it if you can give me some idea. Thanks!

did you modify your dist_train.sh file ?? maybe the problem is there .
here is mine try with it :
##########code##############

CONFIG=$1
GPUS=$2
PORT=${PORT:-29500}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT
\ $(dirname "$0")/train.py $CONFIG ${@:3}