ran out of memory

Question

ran out of memory

SunYanCN opened this issue 5 years ago · comments

TF： tensorflow-gpu==1.12
显卡：Tesla P4 8G
尝试运行run_bidafplus_squad.py，报了显存分配的问题，我不知道这会不会对运行结果有影响

2019-04-07 05:11:40.657538: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-07 05:11:41.446788: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-07 05:11:41.447151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:06.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-04-07 05:11:41.447178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:11:41.882084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:11:41.882132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:11:41.882141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:11:41.882363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:11:42,321 - root - INFO - Reading file at train-v1.1.json
2019-04-07 05:11:42,322 - root - INFO - Processing the dataset.
87599it [07:43, 189.13it/s]
2019-04-07 05:19:25,497 - root - INFO - Reading file at dev-v1.1.json
2019-04-07 05:19:25,497 - root - INFO - Processing the dataset.
10570it [00:53, 196.53it/s]
2019-04-07 05:20:19,349 - root - INFO - Building vocabulary.
100%|███████████████████████████████████| 98169/98169 [00:30<00:00, 3218.07it/s]
2019-04-07 05:21:05.747563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:21:05.747695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:21:05.747711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:21:05.747718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:21:05.747925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:21:06.489069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:21:06.489145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:21:06.489156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:21:06.489162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:21:06.489389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:21:07.117979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:21:07.118055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:21:07.118066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:21:07.118072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:21:07.118278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:21:13,046 - root - INFO - Epoch 1/15
2019-04-07 05:21:13,351 - root - INFO - Eposide 1/2
2019-04-07 05:21:23.422390: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 10494 of 87599
2019-04-07 05:21:33.422566: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 21931 of 87599
2019-04-07 05:21:43.422157: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 32210 of 87599
2019-04-07 05:21:53.422415: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 42018 of 87599
2019-04-07 05:22:03.422089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 52336 of 87599
2019-04-07 05:22:13.422587: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 62125 of 87599
2019-04-07 05:22:23.422099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 72157 of 87599
2019-04-07 05:22:33.421957: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 82242 of 87599
2019-04-07 05:22:38.605655: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
2019-04-07 05:22:57.952087: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.88G (3091968768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-04-07 05:23:27.134938: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.96GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:24:09.911666: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.28GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:01.375542: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.23GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:01.673176: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:33.173192: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.92GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:33.490319: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.93GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:33.502105: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:44,872 - root - INFO - - Train metrics: loss: 5.875
2019-04-07 05:28:46.141381: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:46.477394: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.64GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:47.501813: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:29:05,078 - root - INFO - - Eval metrics: loss: 3.759
2019-04-07 05:29:21,705 - root - INFO - - Eval metrics: exact_match: 51.325 ; f1: 63.040
2019-04-07 05:29:21,705 - root - INFO - - epoch 1 eposide 1: Found new best score: 63.039909
2019-04-07 05:29:21,705 - root - INFO - Eposide 2/2
2019-04-07 05:34:47,135 - root - INFO - - Train metrics: loss: 4.882
2019-04-07 05:35:02,895 - root - INFO - - Eval metrics: loss: 3.376
2019-04-07 05:35:19,210 - root - INFO - - Eval metrics: exact_match: 57.313 ; f1: 68.490
2019-04-07 05:35:19,210 - root - INFO - - epoch 1 eposide 2: Found new best score: 68.490210
2019-04-07 05:35:19,210 - root - INFO - Epoch 2/15
2019-04-07 05:35:19,213 - root - INFO - Eposide 1/2

yylun · Answer 1 · Mon Apr 08 2019 13:27:05 GMT+0800 (China Standard Time)

@SunYanCN Our examples are tested on P40 and V100, so we have not encountered such problem yet. Maybe you could try a smaller batch_size or shuffle_ratio in BatchGenerator

zww847204326 · Answer 2 · Thu Jan 02 2020 20:45:52 GMT+0800 (China Standard Time)

Can you tell me your version of cuda and cudnn. I get trouble when i try to run it.

Possibly insufficient driver version: 415.27.0
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.