[BUG] Cannot train object detection on CPU

Question

[BUG] Cannot train object detection on CPU

chjinche opened this issue 4 years ago · comments

Description

DetectionLearner fit failed on CPU, which is incompatible with the doc description "a GPU is techically not required". See highlighted part in attached pic.

In which platform does it happen?

Linux. CPU

How do we replicate the issue?

Run following codes on CPU, no cuda.

from utils_cv.common.data import unzip_url
from utils_cv.detection.data import Urls as od_urls
from utils_cv.detection.dataset import DetectionDataset
from utils_cv.detection.model import (DetectionLearner, get_pretrained_fasterrcnn)


def od_detection_dataset():
    """ returns a basic detection dataset. """
    tmp_session = 'data'
    tiny_od_data_path = unzip_url(
        od_urls.fridge_objects_tiny_path,
        fpath=tmp_session,
        dest=tmp_session,
        exist_ok=True,
    )
    return DetectionDataset(tiny_od_data_path)


data = od_detection_dataset()
model = get_pretrained_fasterrcnn(
    num_classes=len(data.labels) + 1,
    min_size=100,
    max_size=200,
    rpn_pre_nms_top_n_train=500,
    rpn_pre_nms_top_n_test=250,
    rpn_post_nms_top_n_train=500,
    rpn_post_nms_top_n_test=250,
)
learner = DetectionLearner(data, model=model)
learner.fit(epochs=1)

Got such error:

Epoch: [0]  [ 0/10]  eta: 0:04:00  lr: 0.000560  loss: 1.9363 (1.9363)  loss_classifier: 1.6700 (1.6700)  loss_box_reg: 0.0109 (0.0109)  loss_objectness: 0.2313 (0.2313)  loss_rpn_box_reg: 0.0241 (0.0241)  time: 24.0076  data: 0.1438
Epoch: [0]  [ 9/10]  eta: 0:00:05  lr: 0.005000  loss: 0.1795 (0.6579)  loss_classifier: 0.0431 (0.5310)  loss_box_reg: 0.0002 (0.0013)  loss_objectness: 0.0809 (0.1081)  loss_rpn_box_reg: 0.0135 (0.0175)  time: 5.3490  data: 0.0173
Epoch: [0] Total time: 0:00:53 (5.3514 s / it)
creating index...
index created!
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "bug_cpu_case.py", line 30, in <module>
    learner.fit(epochs=1)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/model.py", line 543, in fit
    e = self.evaluate(dl=self.dataset.test_dl)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/model.py", line 584, in evaluate
    self.results = evaluate(self.model, dl, device=self.device)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/references/engine.py", line 88, in evaluate
    torch.cuda.synchronize()
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 398, in synchronize
    _lazy_init()
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 193, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50

Expected behavior (i.e. solution)

DetectionLearner fit on CPU should run successfully.

Other Comments

U R L commented 4 years ago

Ok.

U R L commented 4 years ago

Ok.

PatrickBue · Answer 1 · Fri Sep 04 2020 20:53:02 GMT+0800 (China Standard Time)

Evaluating accuracy on the test set requires GPU, training does not. Evaluation can be switched of using detector.fit(epochs=1, skip_evaluation=skip_evaluation). See also this notebook:
https://github.com/microsoft/computervision-recipes/blob/master/scenarios/detection/01_training_introduction.ipynb

chjinche · Answer 2 · Fri Sep 04 2020 21:16:45 GMT+0800 (China Standard Time)

@PatrickBue Thanks for quick reply! However, skip_evaluation will make it impossible to find best model and early stop training, which are based on model perf on validation dataset. Any suggestion about this problem?

PatrickBue · Answer 3 · Fri Sep 04 2020 21:21:56 GMT+0800 (China Standard Time)

Unfortunately the library pycocotools (which this repo and torchvision use) requires GPU. One way around that could be to find a library which works on CPU-only and then manually call that library to compute mAP numbers.

chjinche · Answer 4 · Fri Sep 04 2020 21:24:40 GMT+0800 (China Standard Time)

I see. Thanks again for your help @PatrickBue