Cuda device shows not available on EC2 instance
Singh-sid930 opened this issue · comments
I am trying to run the training on an EC2 instance which has Cuda capabilities.
`+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 26C P0 25W / 70W | 2MiB / 15360MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+`
and
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
However I keep getting the following error when I try to run the training. Note that I have run colmap on the same instance which seemed to have run fine using GPU
Optimizing ../output
Output folder: ../output [04/04 06:07:14]
Tensorboard not available: not logging progress [04/04 06:07:14]
Reading camera 1006/1006 [04/04 06:07:18]
Loading Training Cameras [04/04 06:07:19]
Traceback (most recent call last):
File "train.py", line 219, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "train.py", line 35, in training
scene = Scene(dataset, gaussians)
File "/home/ubuntu/workspace/gaussian-splatting/scene/__init__.py", line 73, in __init__
self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 58, in cameraList_from_camInfos
camera_list.append(loadCam(args, id, c, resolution_scale))
File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 52, in loadCam
image_name=cam_info.image_name, uid=id, data_device=args.data_device)
File "/home/ubuntu/workspace/gaussian-splatting/scene/cameras.py", line 39, in __init__
self.original_image = image.clamp(0.0, 1.0).to(self.data_device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.```
What is even stranger is that if I run python console within the terminal in the conda environment, for the same line of code cuda devices work but not when it is ran through the train.py script
(gaussian_splatting) ubuntu@ip-172-31-5-223:~/workspace/gaussian-splatting$ python train.py -s ../data/images/images_1 --data_device cpu
Optimizing
Output folder: ./output/fc4ede38-7 [05/04 06:17:32]
Tensorboard not available: not logging progress [05/04 06:17:32]
Reading camera 1006/1006 [05/04 06:17:36]
Loading Training Cameras [05/04 06:17:37]
Traceback (most recent call last):
File "train.py", line 219, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "train.py", line 35, in training
scene = Scene(dataset, gaussians)
File "/home/ubuntu/workspace/gaussian-splatting/scene/__init__.py", line 73, in __init__
self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 58, in cameraList_from_camInfos
camera_list.append(loadCam(args, id, c, resolution_scale))
File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 52, in loadCam
image_name=cam_info.image_name, uid=id, data_device=args.data_device)
File "/home/ubuntu/workspace/gaussian-splatting/scene/cameras.py", line 53, in __init__
rand_a = torch.rand((3,3)).cuda()
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(gaussian_splatting) ubuntu@ip-172-31-5-223:~/workspace/gaussian-splatting$ python
Python 3.7.13 (default, Oct 18 2022, 18:57:03)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch as torch
>>> rand_a = torch.rand((3,3)).cuda()
>>> rand_a
tensor([[0.3751, 0.8623, 0.5603],
[0.7451, 0.6077, 0.7982],
[0.9916, 0.0623, 0.5862]], device='cuda:0')
```
strangely what fixed the error was changing the number of file limits by using :
unlimit -n 2048
Which led to the realization that my images were a little too big and large in number 1024*1960 resolution 1000 images. And cuda would crash out of memory.
After decreasing the size of the images by almost 8th, things have gotten much better with both colmap and training.