pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

Question

pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

ramdrop opened this issue 2 years ago · comments

System: Ubuntu 18.04
PyTorch 1.9.0 + CUDA 11.1, A100 with 40GB memory.
Hydra 1.0.5

Hello, I run the command and got the output as follows:

command:

poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_3head data=registration/fragment3dmatch training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=200 eval_frequency=10

output:

Error executing job with overrides: ['task=registration', 'models=registration/ms_svconv_base', 'model_
name=MS_SVCONV_B2cm_X2_3head', 'data=registration/fragment3dmatch_sparse', 'training=sparse_fragment_re
g', 'tracker_options.make_submission=True', 'training.epochs=200', 'eval_frequency=10']
Traceback (most recent call last):
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/train.py", line 13, in main
    trainer = Trainer(cfg)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/trainer.py", line 49, i
n __init__
    self._initialize_trainer()
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/trainer.py", line 96, i
n _initialize_trainer
    self._dataset: BaseDataset = instantiate_dataset(self._cfg.data)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/dataset_factor
y.py", line 46, in instantiate_dataset
    dataset = dataset_cls(dataset_config)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/g
eneral3dmatch.py", line 355, in __init__
    self.train_dataset = Fragment3DMatch(
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/general3dmatch.py", line 260, in __init__
    Base3DMatch.__init__(
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 122, in __init__
    super(Base3DMatch, self).__init__(root,
  File "/LOCAL2/ramdrop/apps/poetry/cache/virtualenvs/torch-points3d-s_H0q_C5-py3.9/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 87, in __init__
    self._process()
  File "/LOCAL2/ramdrop/apps/poetry/cache/virtualenvs/torch-points3d-s_H0q_C5-py3.9/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 170, in _process
    self.process()
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/general3dmatch.py", line 300, in process
    super().process()
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 329, in process
    self._create_fragment(self.mode)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 202, in _create_fragment
    rgbd2fragment_fine(list_path_frames,
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/utils.py", line 271, in rgbd2fragment_fine
    tsdf_vol = fusion.TSDFVolume(vol_bnds, voxel_size=voxel_size)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/fusion.py", line 61, in __init__
    self._weight_vol_gpu = cuda.mem_alloc(self._weight_vol_cpu.nbytes)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked the GPU allocated memory recorded by wandb (I tried two different versions of pycuda and both cases resulted in the same error shown above):

Is it normal that the GPU allocated memory keeps increasing during the data preprocessing? I thought A100 with 40GB memory is sufficient for this job. If it isn't, do you know the minimum memory requirement for preprocessing 3DMatch dataset?

Sofiane · Answer 1 · Thu Apr 28 2022 18:53:51 GMT+0800 (China Standard Time)

It is weird because, every experiments were run on an 2080 Ti or a 1080 Ti.
You can find the training set I generated here : https://cloud.mines-paristech.fr/index.php/s/mXN2RuebKjVMhLz

ramdrop · Answer 2 · Fri Apr 29 2022 04:32:24 GMT+0800 (China Standard Time)

Thanks for your generated dataset. I have not managed to solve this issue but I found a workaround: split the raw directory list and run it multiple times to preprocess all splits.

ramdrop · Answer 3 · Fri Apr 29 2022 17:34:40 GMT+0800 (China Standard Time)

Sorry to bother you again, but I found my training results extremely weird: almost zero feature_matching_recall on both val and test dataset after 50 epochs. I suspect this could result from data preprocessing. So other than the training set you provided, would you mind sharing with me your full 3DMatch dataset as follows?