分布式训练

Question

分布式训练

cndivecat opened this issue a year ago · comments

请问该模型只能使用分布式训练吗？如何不使用分布式训练使用该模型？

gmk11 · Answer 1 · Wed Apr 19 2023 22:29:01 GMT+0800 (China Standard Time)

well you have to put the parameter distributed = False
in your train and test file you have these lines:

 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""

replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file)
and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

秋山君 · Answer 2 · Sat May 06 2023 18:13:47 GMT+0800 (China Standard Time)

well you have to put the parameter distributed = False in your train and test file you have these lines:
 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

您好，关于PCT这个项目可以联系下吗

gmk11 · Answer 3 · Wed May 10 2023 15:46:45 GMT+0800 (China Standard Time)

@qiushanjun yes if you want

pydd123 · Answer 4 · Wed Jun 07 2023 11:05:19 GMT+0800 (China Standard Time)

@gmk11 follow your step but I find another error：
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_clamp_Tensor)

gmk11 · Answer 5 · Wed Jun 07 2023 20:50:50 GMT+0800 (China Standard Time)

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU
To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

pydd123 · Answer 6 · Wed Jun 07 2023 21:02:10 GMT+0800 (China Standard Time)

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

Thanks you,I have solven this question.I think this is the swin's problem.
pct_swin_v2.py line 322:
logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

gmk11 · Answer 7 · Wed Jun 07 2023 21:07:50 GMT+0800 (China Standard Time)

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

Thanks you,I have solven this question.I think this is the swin's problem. pct_swin_v2.py line 322: logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

cool , i hope everything works fine now

Algaeee · Answer 8 · Fri Jun 09 2023 10:20:35 GMT+0800 (China Standard Time)

well you have to put the parameter distributed = False in your train and test file you have these lines:
 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

For me, it seems only replacing the above code with distributed = False would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params) it worked.
But I feel like it's not non-distributed training 😭
I would appreciate it if you can give me some idea. Thanks!

gmk11 · Answer 9 · Mon Jun 12 2023 20:39:54 GMT+0800 (China Standard Time)

well you have to put the parameter distributed = False in your train and test file you have these lines:
 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False
also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/dist_train.sh configs/pct_[base/large/huge]tokenizer.py 8' by './tools/dist_train.sh configs/pct[base/large/huge]_tokenizer.py 1'
it worked for me
For me, it seems only replacing the above code with distributed = False would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params) it worked. But I feel like it's not non-distributed training sob I would appreciate it if you can give me some idea. Thanks!

did you modify your dist_train.sh file ?? maybe the problem is there .
here is mine try with it :
##########code##############

CONFIG=$1
GPUS=$2
PORT=${PORT:-29500}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT
\ $(dirname "$0")/train.py $CONFIG ${@:3}