[BUG] Cannot train object detection on CPU
chjinche opened this issue · comments
Description
DetectionLearner
fit
failed on CPU, which is incompatible with the doc description "a GPU is techically not required". See highlighted part in attached pic.
In which platform does it happen?
Linux. CPU
How do we replicate the issue?
Run following codes on CPU, no cuda.
from utils_cv.common.data import unzip_url
from utils_cv.detection.data import Urls as od_urls
from utils_cv.detection.dataset import DetectionDataset
from utils_cv.detection.model import (DetectionLearner, get_pretrained_fasterrcnn)
def od_detection_dataset():
""" returns a basic detection dataset. """
tmp_session = 'data'
tiny_od_data_path = unzip_url(
od_urls.fridge_objects_tiny_path,
fpath=tmp_session,
dest=tmp_session,
exist_ok=True,
)
return DetectionDataset(tiny_od_data_path)
data = od_detection_dataset()
model = get_pretrained_fasterrcnn(
num_classes=len(data.labels) + 1,
min_size=100,
max_size=200,
rpn_pre_nms_top_n_train=500,
rpn_pre_nms_top_n_test=250,
rpn_post_nms_top_n_train=500,
rpn_post_nms_top_n_test=250,
)
learner = DetectionLearner(data, model=model)
learner.fit(epochs=1)
Got such error:
Epoch: [0] [ 0/10] eta: 0:04:00 lr: 0.000560 loss: 1.9363 (1.9363) loss_classifier: 1.6700 (1.6700) loss_box_reg: 0.0109 (0.0109) loss_objectness: 0.2313 (0.2313) loss_rpn_box_reg: 0.0241 (0.0241) time: 24.0076 data: 0.1438
Epoch: [0] [ 9/10] eta: 0:00:05 lr: 0.005000 loss: 0.1795 (0.6579) loss_classifier: 0.0431 (0.5310) loss_box_reg: 0.0002 (0.0013) loss_objectness: 0.0809 (0.1081) loss_rpn_box_reg: 0.0135 (0.0175) time: 5.3490 data: 0.0173
Epoch: [0] Total time: 0:00:53 (5.3514 s / it)
creating index...
index created!
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
File "bug_cpu_case.py", line 30, in <module>
learner.fit(epochs=1)
File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/model.py", line 543, in fit
e = self.evaluate(dl=self.dataset.test_dl)
File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/model.py", line 584, in evaluate
self.results = evaluate(self.model, dl, device=self.device)
File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
return func(*args, **kwargs)
File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/references/engine.py", line 88, in evaluate
torch.cuda.synchronize()
File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 398, in synchronize
_lazy_init()
File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 193, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50
Expected behavior (i.e. solution)
DetectionLearner
fit
on CPU should run successfully.
Other Comments
Evaluating accuracy on the test set requires GPU, training does not. Evaluation can be switched of using detector.fit(epochs=1, skip_evaluation=skip_evaluation)
. See also this notebook:
https://github.com/microsoft/computervision-recipes/blob/master/scenarios/detection/01_training_introduction.ipynb
@PatrickBue Thanks for quick reply! However, skip_evaluation
will make it impossible to find best model and early stop training, which are based on model perf on validation dataset. Any suggestion about this problem?
Unfortunately the library pycocotools (which this repo and torchvision use) requires GPU. One way around that could be to find a library which works on CPU-only and then manually call that library to compute mAP numbers.
I see. Thanks again for your help @PatrickBue
Ok.
Ok.