training problem

Question

training problem

lapetite123 opened this issue 5 years ago · comments

I used "python train.py --gpu 2 --batch_size 24 --max_epoch 100 --log_dir log5 --learning_rate 0.001 --decay_step 300000 --restore_model None --input_list /home/ASIS/data/train_hdf5_file_list_woArea5.txt" to train, but it reported
2019-09-30 22:36:43.451109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:02:00.0
totalMemory: 10.73GiB freeMemory: 64.56MiB
2019-09-30 22:36:43.557029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:03:00.0
totalMemory: 10.73GiB freeMemory: 54.56MiB
2019-09-30 22:36:43.696038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 2 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:82:00.0
totalMemory: 10.73GiB freeMemory: 62.56MiB
2019-09-30 22:36:43.817509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 3 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:83:00.0
totalMemory: 10.73GiB freeMemory: 64.56MiB
2019-09-30 22:36:43.817819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2019-09-30 22:36:43.818075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 2 3
2019-09-30 22:36:43.818086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0: Y N N N
2019-09-30 22:36:43.818093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1: N Y N N
2019-09-30 22:36:43.818098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 2: N N Y N
2019-09-30 22:36:43.818104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 3: N N N Y
2019-09-30 22:36:43.818116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
2019-09-30 22:36:43.818124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5)
2019-09-30 22:36:43.818131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:82:00.0, compute capability: 7.5)
2019-09-30 22:36:43.818138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5)
2019-09-30 22:36:44.557000: E tensorflow/core/common_runtime/direct_session.cc:168] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
Traceback (most recent call last):
File "train.py", line 256, in
train()
File "train.py", line 165, in train
sess = tf.Session(config=config)
File "/home/anaconda3/envs/asis/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1509, in init
super(Session, self).init(target, graph, config=config)
File "/home/anaconda3/envs/asis/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 628, in init
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/anaconda3/envs/asis/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

I don't know why I used gpu 2 to train the code, but it tells me gpu 0 is out of memory. can you please tell the solution