[BUG] - jupyter-notebook/exported python fails in scenarios/detection/01_training_introduction.ipynb
ericscottmarquez opened this issue · comments
Description
Issue with possibly cuda version or torch version compatibility or memory constriction
In which platform does it happen?
Kubuntu using 20.04 focal kernel
GTX 1050ti max-q (4GB GDDR5 memory)
Core i7-8750h 6-core/12-threads
16gb RAM
How do we replicate the issue?
-
output from nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
-
output from nvidia-smi:
Tue Oct 27 16:35:58 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... On | 00000000:01:00.0 Off | N/A |
| N/A 59C P0 N/A / N/A | 775MiB / 4042MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 993 G /usr/lib/xorg/Xorg 491MiB |
| 0 N/A N/A 1273 G /usr/bin/plasmashell 103MiB |
| 0 N/A N/A 41430 G ...AAAAAAAAA= --shared-files 117MiB |
| 0 N/A N/A 45817 G /usr/bin/systemsettings5 57MiB |
+-----------------------------------------------------------------------------+
-
Error is while running:
/computervision-recipes/scenarios/detection/01_training_introduction.ipynb/
-
The Same error also occurs when exporting the notebook as .py and running it outside of jupyter
-
I am using a custom data set with my own annotations in the proper format. I don't think this is the issue, in the jupyter notebook, it returns the image transformations and the annotation overlays correctly.
ERROR:
- in
/notebooks/scenarios/detection/01_training_introduction.ipynb
:
atdetector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)
detector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
723 try:
--> 724 data = self.data_queue.get(timeout=timeout)
725 return (True, data)
~/anaconda3/envs/cv/lib/python3.7/queue.py in get(self, block, timeout)
178 raise Empty
--> 179 self.not_empty.wait(remaining)
180 item = self._get()
~/anaconda3/envs/cv/lib/python3.7/threading.py in wait(self, timeout)
299 if timeout > 0:
--> 300 gotit = waiter.acquire(True, timeout)
301 else:
~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
65 # Python can still get and update the process status successfully.
---> 66 _error_if_any_worker_fails()
67 if previous_handler is not None:
RuntimeError: DataLoader worker (pid 43073) is killed by signal: Killed.
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-12-0c6531329277> in <module>
----> 1 detector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)
~/Desktop/projects/computervision-recipes/utils_cv/detection/model.py in fit(self, epochs, lr, momentum, weight_decay, print_freq, step_size, gamma, skip_evaluation)
532 self.device,
533 epoch,
--> 534 print_freq=print_freq,
535 )
536 self.losses.append(logger.meters["loss"].median)
~/Desktop/projects/computervision-recipes/utils_cv/detection/references/engine.py in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
24 lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
25
---> 26 for images, targets in metric_logger.log_every(data_loader, print_freq, header):
27 images = list(image.to(device) for image in images)
28 targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
~/Desktop/projects/computervision-recipes/utils_cv/detection/references/utils.py in log_every(self, iterable, print_freq, header)
209 ])
210 MB = 1024.0 * 1024.0
--> 211 for obj in iterable:
212 data_time.update(time.time() - end)
213 yield obj
~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
802
803 assert not self.shutdown and self.tasks_outstanding > 0
--> 804 idx, data = self._get_data()
805 self.tasks_outstanding -= 1
806
~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _get_data(self)
759 elif self.pin_memory:
760 while self.pin_memory_thread.is_alive():
--> 761 success, data = self._try_get_data()
762 if success:
763 return data
~/anaconda3/envs/cv/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
735 if len(failed_workers) > 0:
736 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 737 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
738 if isinstance(e, queue.Empty):
739 return (False, None)
RuntimeError: DataLoader worker (pid(s) 43073) exited unexpectedly
- I've already done some research and it seems some point to not enough memory or some sort of memory restriction, I tried changing the value for
num_workers
to 0, and reducing the batches, but I may not be setting it in the right place... So, if that is the solution, please help in pointing me to the right direction! - If I'm totally missing something, please educate me on what I am possibly doing wrong. Thanks!
Expected behavior (i.e. solution)
- Expected to train the dataset in the
Finetune a Pretrained Model
step in thescenarios/detection/01_training_introduction.ipynb
notebook
Other Comments
Thank you all in advance for the help, I hope I can go back to the other issues and link to this one to help those that face a similar problem!