I used distributed cuda but there are not cuda occupied
Zhangang1999 opened this issue · comments
I follow the this page(https://mmeval.readthedocs.io/zh_CN/latest/tutorials/dist_evaluation.html).
I only change the
accuracy = Accuracy(topk=(1, 3), dist_backend='torch_cpu') into
accuracy = Accuracy(topk=(1, 3), dist_backend='torch_cuda')
But there are not cuda occupied.
My device is 1080*4.Cuda 10.2
Hi @Zhangang1999 , thanks for your attention to MMEval!
The dist_backend
determines which process synchronization method is used, but not the device where the metrics are calculated. ^_^
If you want to compute Accuracy on CUDA device, you just need to make sure the input tensor is on the CUDA device. And the code in tutorials should change to:
def eval_fn(rank, process_num):
# 分布式环境初始化
torch.distributed.init_process_group(
backend='gloo',
init_method=f'tcp://127.0.0.1:2345',
world_size=process_num,
rank=rank)
torch.cuda.set_device(f‘cuda:{rank}’)
eval_dataloader, total_num_samples = get_eval_dataloader(rank, process_num)
model = get_model().cuda()
# 实例化 Accuracy 并设置分布式通信后端
accuracy = Accuracy(topk=(1, 3), dist_backend='torch_cuda')
with torch.no_grad():
for images, labels in tqdm.tqdm(eval_dataloader, disable=(rank!=0)):
predicted_score = model(images.cuda())
accuracy.add(predictions=predicted_score, labels=labels.cuda())
# 通过 size 指定数据集样本数量,以便去除 DistributedSampler 补齐的重复样本。
print(accuracy.compute(size=total_num_samples))
accuracy.reset()
Feel free to feedback if you have any problems~
@ice-tong
OK.It has worked.Thanks for your replaying.
And I still have two question.
Can I use the designated devices.Because lot of us share this machine.
Can you introducate the pipeline of the DistBackend or do you have the course of it in MMLabs?
Anyway,thanks for your answer.
Can I use the designated devices.
For the first question, you can specify the GPU to be used by the CUDA_VISIBLE_DEVICES
env variable.
NOTE: Multi-rank in one GPU is not allowed since NCCL 2.5.
Can you introduce the pipeline of the DistBackend?
The BaseDistBackend
is a base class that provides an all_gather_object
and broadcast_object
interface used by BaseMetric.compute
. Maybe the following code snippet can be helpful ~
mmeval/mmeval/core/base_metric.py
Lines 109 to 154 in de1e4eb