[BUG] - jupyter-notebook/exported python fails in scenarios/detection/01_training_introduction.ipynb

Question

[BUG] - jupyter-notebook/exported python fails in scenarios/detection/01_training_introduction.ipynb

ericscottmarquez opened this issue 4 years ago · comments

Eric commented 4 years ago

Description

Issue with possibly cuda version or torch version compatibility or memory constriction

In which platform does it happen?

Kubuntu using 20.04 focal kernel
GTX 1050ti max-q (4GB GDDR5 memory)
Core i7-8750h 6-core/12-threads
16gb RAM

How do we replicate the issue?

output from nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
output from nvidia-smi:

Tue Oct 27 16:35:58 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P0    N/A /  N/A |    775MiB /  4042MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       993      G   /usr/lib/xorg/Xorg                491MiB |
|    0   N/A  N/A      1273      G   /usr/bin/plasmashell              103MiB |
|    0   N/A  N/A     41430      G   ...AAAAAAAAA= --shared-files      117MiB |
|    0   N/A  N/A     45817      G   /usr/bin/systemsettings5           57MiB |
+-----------------------------------------------------------------------------+

Error is while running: /computervision-recipes/scenarios/detection/01_training_introduction.ipynb/
The Same error also occurs when exporting the notebook as .py and running it outside of jupyter
I am using a custom data set with my own annotations in the proper format. I don't think this is the issue, in the jupyter notebook, it returns the image transformations and the annotation overlays correctly.

ERROR:

in /notebooks/scenarios/detection/01_training_introduction.ipynb:
at detector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)

detector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
    723         try:
--> 724             data = self.data_queue.get(timeout=timeout)
    725             return (True, data)

~/anaconda3/envs/cv/lib/python3.7/queue.py in get(self, block, timeout)
    178                         raise Empty
--> 179                     self.not_empty.wait(remaining)
    180             item = self._get()

~/anaconda3/envs/cv/lib/python3.7/threading.py in wait(self, timeout)
    299                 if timeout > 0:
--> 300                     gotit = waiter.acquire(True, timeout)
    301                 else:

~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
     65         # Python can still get and update the process status successfully.
---> 66         _error_if_any_worker_fails()
     67         if previous_handler is not None:

RuntimeError: DataLoader worker (pid 43073) is killed by signal: Killed. 

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
<ipython-input-12-0c6531329277> in <module>
----> 1 detector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)

~/Desktop/projects/computervision-recipes/utils_cv/detection/model.py in fit(self, epochs, lr, momentum, weight_decay, print_freq, step_size, gamma, skip_evaluation)
    532                 self.device,
    533                 epoch,
--> 534                 print_freq=print_freq,
    535             )
    536             self.losses.append(logger.meters["loss"].median)

~/Desktop/projects/computervision-recipes/utils_cv/detection/references/engine.py in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
     24         lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
     25 
---> 26     for images, targets in metric_logger.log_every(data_loader, print_freq, header):
     27         images = list(image.to(device) for image in images)
     28         targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

~/Desktop/projects/computervision-recipes/utils_cv/detection/references/utils.py in log_every(self, iterable, print_freq, header)
    209             ])
    210         MB = 1024.0 * 1024.0
--> 211         for obj in iterable:
    212             data_time.update(time.time() - end)
    213             yield obj

~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    802 
    803             assert not self.shutdown and self.tasks_outstanding > 0
--> 804             idx, data = self._get_data()
    805             self.tasks_outstanding -= 1
    806 

~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _get_data(self)
    759         elif self.pin_memory:
    760             while self.pin_memory_thread.is_alive():
--> 761                 success, data = self._try_get_data()
    762                 if success:
    763                     return data

~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
    735             if len(failed_workers) > 0:
    736                 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 737                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    738             if isinstance(e, queue.Empty):
    739                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 43073) exited unexpectedly

I've already done some research and it seems some point to not enough memory or some sort of memory restriction, I tried changing the value for num_workers to 0, and reducing the batches, but I may not be setting it in the right place... So, if that is the solution, please help in pointing me to the right direction! - If I'm totally missing something, please educate me on what I am possibly doing wrong. Thanks!

Expected behavior (i.e. solution)

Expected to train the dataset in the Finetune a Pretrained Model step in the scenarios/detection/01_training_introduction.ipynb notebook

Other Comments

Thank you all in advance for the help, I hope I can go back to the other issues and link to this one to help those that face a similar problem!