bad_alloc in run_cls.py

Question

bad_alloc in run_cls.py

christinazavou opened this issue 4 years ago · comments

Hello and thank you for your work :)
I'm trying to run the OCNN classification in tensorflow ..
and i get a bad_alloc error:

Do you have any hint regarding this ?

Peng-Shuai Wang · Answer 1 · Wed Jul 22 2020 16:35:03 GMT+0800 (China Standard Time)

Thanks for your interest in our project!
Did you strictly follow our instructions to build/run the code? If so, could you please provide your runtime environment (such as your system, tensorflow version, and cuda version etc.)?

Christina Zavou · Answer 2 · Fri Jul 24 2020 17:42:54 GMT+0800 (China Standard Time)

I started over following carefully your instructions both on my laptop and on pc. This time on my laptop it worked (even though i get the same error on pc ... it might be related to memory but feel free to close the issue as it now runs on my laptop!)

While redoing it, I reported all my actions , and as it could be useful for someone I list them below.

I cloned the repo on ubuntu 16 (PC) and on ubuntu 18 (laptop)
I removed cmake and installed cmake version 3.18.0
Under O-CNN repo I run

cd octree/external && git clone --recursive https://github.com/wang-ps/octree-ext.git
cd .. && mkdir build && cd build

I installed cuda10.1 with

sudo sh cuda_10.1.105_418.39_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-10.1

Then I run

sh make_env_for_tf.sh OCNN /home/graphicslab/miniconda3/envs 10.1 3.7
source activate OCNN && conda install -c anaconda tensorflow-gpu==1.14.0 && conda install -c anaconda pytest

where make_env_for_tf.sh is:

#!/bin/sh

env_name=$1
envs_path=$2
cuda_version=$3
py_version=$4

conda create --name $env_name python=$py_version

scripts_path=$envs_path/$env_name/etc/conda

# Scripts under this folder are run whenever conda environment is activated
mkdir -p "$scripts_path/activate.d"
touch "$scripts_path/activate.d/activate.sh"

echo """#!/bin/sh
ORIGINAL_LD_LIBRARY_PATH=\$LD_LIBRARY_PATH
ORIGINAL_PATH=\$PATH
ORIGINAL_CUDA_DIR=\$CUDA_DIR
export LD_LIBRARY_PATH=/usr/local/cuda-$cuda_version/lib64:/usr/local/cuda-$cuda_version/extras/CUPTI/lib64:\$LD_LIBRARY_PATH
export PATH=\$PATH:/usr/local/cuda-$cuda_version/bin
export CUDADIR=/usr/local/cuda-$cuda_version""" >> "$scripts_path/activate.d/activate.sh"


# Scripts under this folder are run whenever conda environment is deactivated
mkdir -p "$scripts_path/deactivate.d"
touch "$scripts_path/deactivate.d/deactivate.sh"

echo """#!/bin/sh
export LD_LIBRARY_PATH=\$ORIGINAL_LD_LIBRARY_PATH
export PATH=\$ORIGINAL_PATH
export CUDA_DIR=\$ORIGINAL_CUDA_DIR
unset ORIGINAL_LD_LIBRARY_PATH
unset ORIGINAL_PATH
unset ORIGINAL_CUDA_DIR""" >> "$scripts_path/deactivate.d/deactivate.sh"

I activated the python environment (which also sets cuda paths):

source activate OCNN
nvcc --version

this showed on PC:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

and on laptop:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Back to O-CNN project (under /octree/build) I run

cmake ..  && cmake --build . --config Release
export PATH=`pwd`:$PATH

to use the tensorflow code i run the following (starting from octree/build and having OCNN environment activated):

conda install -c conda-forge yacs tqdm
cmake .. -DUSE_CUDA=ON && make
cd ../../tensorflow/libs
python build.py

i was getting some numpy warnings so i run

conda install gast==0.2.2 numpy==1.16.4

Then in /tensorflow/script i run:

python ../data/cls_modelnet.py --run download_m40_points
python ../data/cls_modelnet.py --run m40_generate_octree_tfrecords

python run_cls.py --config configs/cls_octree.yaml
where I changed cls_octree.yaml into:

SOLVER:
  gpu: 0,
  logdir: logs/m40/0322_ocnn_octree
  run: train
  max_iter: 160000
  test_iter: 925
  test_every_iter: 100
  step_size: (400,)

DATA:
  train:
    dtype: octree
    distort: True
    depth: 5
    location: dataset/ModelNet40/m40_5_2_12_train_octree.tfrecords 
    batch_size: 2
    x_alias: data
  test: 
    dtype: octree
    distort: False
    depth: 5
    location: dataset/ModelNet40/m40_5_2_12_test_octree.tfrecords
    shuffle: 0
    batch_size: 2
    x_alias: data

MODEL:
  name: ocnn
  channel: 3
  nout: 40
  depth: 5

LOSS:
  num_class: 40
  weight_decay: 0.0005

I tried with batch 2,4,8 and they can all run on the laptop. (on pc i get the bad_alloc issue).

Other attributes of the pc:
- gpu card GeForce GTX 1080 Ti
- running command "free -h" gives:
total used free shared buff/cache available
Mem: 15G 4,5G 2,9G 91M 8,2G 10G
Swap: 7,6G 33M 7,6G

Other attributes of the laptop:
- gpu card GeForce GTX 1050
- running command "free -h" gives:
total used free shared buff/cache available
Mem: 15G 6,1G 1,3G 223M 8,1G 8,9G
Swap: 15G 51M 15G

A side note, regarding some unit tests (test_octree_conv, test_octree_search, test_octree_gather):
Because self.assert... just raises error if there is one, and otherwise it does nothing, my pytest was saying "skipped" instead of success
therefore i added in the class:

def setUp(self):
    self.verificationErrors = []

def tearDown(self):
  self.assertEqual([], self.verificationErrors)

and replaced every self.assert..
with

try:
  self.assert..
except AssertionError as e:
  self.verificationErrors.append(str(e))

Peng-Shuai Wang · Answer 3 · Mon Aug 03 2020 12:47:59 GMT+0800 (China Standard Time)

Many thanks for your detailed feedback. I tried on my own PC just now according to your instructions and found it worked without errors. Sorry that I can not provide more suggestions.

By the way, I noticed that you changed the batch size from 32 to 2, 4, or 8. Since there are batch-normalization layers in the network, if the batch size is too small (such as 2 or 4), the final testing accuracy may decrease.
With regarding to this point, I suggest using a batch size of at least 16.

Christina Zavou · Answer 4 · Mon Aug 03 2020 13:55:11 GMT+0800 (China Standard Time)

you are right, thanks for mentioning it :)