bad_alloc in run_cls.py
christinazavou opened this issue · comments
Thanks for your interest in our project!
Did you strictly follow our instructions to build/run the code? If so, could you please provide your runtime environment (such as your system, tensorflow version, and cuda version etc.)?
I started over following carefully your instructions both on my laptop and on pc. This time on my laptop it worked (even though i get the same error on pc ... it might be related to memory but feel free to close the issue as it now runs on my laptop!)
While redoing it, I reported all my actions , and as it could be useful for someone I list them below.
-
I cloned the repo on ubuntu 16 (PC) and on ubuntu 18 (laptop)
-
I removed cmake and installed cmake version 3.18.0
-
Under O-CNN repo I run
cd octree/external && git clone --recursive https://github.com/wang-ps/octree-ext.git
cd .. && mkdir build && cd build
- I installed cuda10.1 with
sudo sh cuda_10.1.105_418.39_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-10.1
- Then I run
sh make_env_for_tf.sh OCNN /home/graphicslab/miniconda3/envs 10.1 3.7
source activate OCNN && conda install -c anaconda tensorflow-gpu==1.14.0 && conda install -c anaconda pytest
where make_env_for_tf.sh is:
#!/bin/sh
env_name=$1
envs_path=$2
cuda_version=$3
py_version=$4
conda create --name $env_name python=$py_version
scripts_path=$envs_path/$env_name/etc/conda
# Scripts under this folder are run whenever conda environment is activated
mkdir -p "$scripts_path/activate.d"
touch "$scripts_path/activate.d/activate.sh"
echo """#!/bin/sh
ORIGINAL_LD_LIBRARY_PATH=\$LD_LIBRARY_PATH
ORIGINAL_PATH=\$PATH
ORIGINAL_CUDA_DIR=\$CUDA_DIR
export LD_LIBRARY_PATH=/usr/local/cuda-$cuda_version/lib64:/usr/local/cuda-$cuda_version/extras/CUPTI/lib64:\$LD_LIBRARY_PATH
export PATH=\$PATH:/usr/local/cuda-$cuda_version/bin
export CUDADIR=/usr/local/cuda-$cuda_version""" >> "$scripts_path/activate.d/activate.sh"
# Scripts under this folder are run whenever conda environment is deactivated
mkdir -p "$scripts_path/deactivate.d"
touch "$scripts_path/deactivate.d/deactivate.sh"
echo """#!/bin/sh
export LD_LIBRARY_PATH=\$ORIGINAL_LD_LIBRARY_PATH
export PATH=\$ORIGINAL_PATH
export CUDA_DIR=\$ORIGINAL_CUDA_DIR
unset ORIGINAL_LD_LIBRARY_PATH
unset ORIGINAL_PATH
unset ORIGINAL_CUDA_DIR""" >> "$scripts_path/deactivate.d/deactivate.sh"
- I activated the python environment (which also sets cuda paths):
source activate OCNN
nvcc --version
this showed on PC:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
and on laptop:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
- Back to O-CNN project (under /octree/build) I run
cmake .. && cmake --build . --config Release
export PATH=`pwd`:$PATH
- to use the tensorflow code i run the following (starting from octree/build and having OCNN environment activated):
conda install -c conda-forge yacs tqdm
cmake .. -DUSE_CUDA=ON && make
cd ../../tensorflow/libs
python build.py
i was getting some numpy warnings so i run
conda install gast==0.2.2 numpy==1.16.4
- Then in /tensorflow/script i run:
python ../data/cls_modelnet.py --run download_m40_points
python ../data/cls_modelnet.py --run m40_generate_octree_tfrecords
python run_cls.py --config configs/cls_octree.yaml
where I changed cls_octree.yaml into:
SOLVER:
gpu: 0,
logdir: logs/m40/0322_ocnn_octree
run: train
max_iter: 160000
test_iter: 925
test_every_iter: 100
step_size: (400,)
DATA:
train:
dtype: octree
distort: True
depth: 5
location: dataset/ModelNet40/m40_5_2_12_train_octree.tfrecords
batch_size: 2
x_alias: data
test:
dtype: octree
distort: False
depth: 5
location: dataset/ModelNet40/m40_5_2_12_test_octree.tfrecords
shuffle: 0
batch_size: 2
x_alias: data
MODEL:
name: ocnn
channel: 3
nout: 40
depth: 5
LOSS:
num_class: 40
weight_decay: 0.0005
I tried with batch 2,4,8 and they can all run on the laptop. (on pc i get the bad_alloc issue).
Other attributes of the pc:
- gpu card GeForce GTX 1080 Ti
- running command "free -h" gives:
total used free shared buff/cache available
Mem: 15G 4,5G 2,9G 91M 8,2G 10G
Swap: 7,6G 33M 7,6G
Other attributes of the laptop:
- gpu card GeForce GTX 1050
- running command "free -h" gives:
total used free shared buff/cache available
Mem: 15G 6,1G 1,3G 223M 8,1G 8,9G
Swap: 15G 51M 15G
A side note, regarding some unit tests (test_octree_conv, test_octree_search, test_octree_gather):
Because self.assert... just raises error if there is one, and otherwise it does nothing, my pytest was saying "skipped" instead of success
therefore i added in the class:
def setUp(self):
self.verificationErrors = []
def tearDown(self):
self.assertEqual([], self.verificationErrors)
and replaced every self.assert..
with
try:
self.assert..
except AssertionError as e:
self.verificationErrors.append(str(e))
Many thanks for your detailed feedback. I tried on my own PC just now according to your instructions and found it worked without errors. Sorry that I can not provide more suggestions.
By the way, I noticed that you changed the batch size from 32 to 2, 4, or 8. Since there are batch-normalization layers in the network, if the batch size is too small (such as 2 or 4), the final testing accuracy may decrease.
With regarding to this point, I suggest using a batch size of at least 16.
you are right, thanks for mentioning it :)