FACEGOOD / FACEGOOD-Audio2Face

http://www.facegood.cc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

训练模型时提示GPU内存不足&是否能上传训练好的model.ckpt

pbhfcycssjlmm opened this issue · comments

问题概述

当我运行命令
python step14_train.py --epochs 8 --dataSet dataSet1
最后报错终止程序,控制台提示(完整报错信息放在最后):
(0) Internal: Blas GEMM launch failed : a.shape=(32, 272), b.shape=(272, 150), m=32, n=150, k=272
[[node dense/MatMul (defined at C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]

初步筛查

我去网上查这个报错信息,发现主要都是讲GPU内存不足、GPU被其他进程占用的问题。
经排查,GPU只运行了这个程序,后面按照网上的方法为tf.GPUOptions添加了allow_growth=True,或者将per_process_gpu_memory_fraction调低一些也没用(单独测试或者组合测试都失败了)

电脑配置

系统:win10
GPU:device:GPU:0 with 9830 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6
Python版本:3.7.11
CUDA版本:cuda_10.0.130_411.31_win10
cudnn版本:cudnn-10.0-windows10-x64-v7.6.5.32

程序运行时内存变化

另外程序运行时内存变化状况如下:
加载cublas64_100.dll之后GPU就直接从437M到了10358M(总共12288M),此时占用率已经到了84.3%了,应该已经突破per_process_gpu_memory_fraction=0.8的限制了
加载cudnn64_7.dll之后GPU到了10424M(总共12288M)——最大达到了10645M
之后程序就崩了

完整报错信息

2022-03-04 09:34:16.145502: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
WARNING:tensorflow:From step14_train.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From step14_train.py:37: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

(2200, 32, 64, 1)
(2200, 90000)
(1000, 32, 64, 1)
(1000, 90000)
WARNING:tensorflow:From step14_train.py:86: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

WARNING:tensorflow:From step14_train.py:86: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From step14_train.py:90: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From D:\Ningxin\Coding\Voice2Face-main\code\train\model_paper.py:21: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras.layers.Conv2D instead.
WARNING:tensorflow:From C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\layers\convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.__call__ method instead.
WARNING:tensorflow:From D:\Ningxin\Coding\Voice2Face-main\code\train\model_paper.py:49: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From D:\Ningxin\Coding\Voice2Face-main\code\train\model_paper.py:51: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From D:\Ningxin\Coding\Voice2Face-main\code\train\model_paper.py:52: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
WARNING:tensorflow:From step14_train.py:98: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From step14_train.py:103: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From step14_train.py:105: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From step14_train.py:105: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From step14_train.py:106: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From step14_train.py:108: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From step14_train.py:111: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

WARNING:tensorflow:From step14_train.py:122: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From step14_train.py:122: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2022-03-04 09:34:25.346984: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2022-03-04 09:34:25.351173: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2022-03-04 09:34:25.389703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2022-03-04 09:34:25.389860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2022-03-04 09:34:25.469703: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2022-03-04 09:34:25.566662: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2022-03-04 09:34:25.589486: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2022-03-04 09:34:25.662159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2022-03-04 09:34:25.713455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2022-03-04 09:34:25.799900: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-03-04 09:34:25.800359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-03-04 09:37:34.818241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-04 09:37:34.818380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2022-03-04 09:37:34.818619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2022-03-04 09:37:34.819527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9830 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6)
WARNING:tensorflow:From step14_train.py:126: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From step14_train.py:127: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

2022-03-04 09:37:35.857089: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2022-03-04 09:38:32.908660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-03-04 09:49:39.479241: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2022-03-04 09:49:39.610678: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
target_list, run_metadata)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(32, 272), b.shape=(272, 150), m=32, n=150, k=272
[[{{node dense/MatMul}}]]
[[Adam/update/_38]]
(1) Internal: Blas GEMM launch failed : a.shape=(32, 272), b.shape=(272, 150), m=32, n=150, k=272
[[{{node dense/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "step14_train.py", line 190, in
train()
File "step14_train.py", line 136, in train
train_op = sess.run(train_step, feed_dict={data: train_data, label: train_label, keep_pro: 0.5})
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(32, 272), b.shape=(272, 150), m=32, n=150, k=272
[[node dense/MatMul (defined at C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
[[Adam/update/_38]]
(1) Internal: Blas GEMM launch failed : a.shape=(32, 272), b.shape=(272, 150), m=32, n=150, k=272
[[node dense/MatMul (defined at C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'dense/MatMul':
File "step14_train.py", line 190, in
train()
File "step14_train.py", line 95, in train
output, emotion_input = net(data,output_size,keep_pro)
File "D:\Ningxin\Coding\Voice2Face-main\code\train\model_paper.py", line 51, in net
fc1 = tf.layers.dense(inputs=flat, units=150 , activation=None) #activation=None表示使用线性激活器
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\layers\core.py", line 187, in dense
return layer.apply(inputs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 1700, in apply
return self.call(inputs, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\layers\base.py", line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\autograph\impl\api.py", line 234, in wrapper
return converted_call(f, options, args, kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\autograph\impl\api.py", line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\autograph\impl\api.py", line 330, in _call_unconverted
return f(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\keras\layers\core.py", line 1050, in call
outputs = gen_math_ops.mat_mul(inputs, self.kernel)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py", line 6136, in mat_mul
name=name)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "C:\ProgramData\Anaconda3\envs\py37_tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()