Getting Error when training model

Question

Getting Error when training model

ShiinaMitsuki opened this issue 6 years ago · comments

Hi there, I followed the instruction inthe README but got error as below:

(dcgan) [sobey123@localhost DCGAN-tensorflow]$ ./train.sh
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
{'batch_size': 64,
'beta1': 0.5,
'checkpoint_dir': 'checkpoint',
'crop': False,
'dataset': 'market',
'epoch': 100,
'input_fname_pattern': '.jpg',
'input_height': 128,
'input_width': None,
'learning_rate': 0.0002,
'options': 1,
'output_height': 256,
'output_path': 'duke_result',
'output_width': None,
'sample_dir': 'samples',
'sample_size': 1000,
'train': True,
'train_size': inf,
'unrolled_lstm': False,
'visualize': False}
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:84:00.0
Total memory: 7.93GiB
Free memory: 7.83GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:84:00.0)
WARNING:tensorflow:From /home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py:109 in build_model.: histogram_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.histogram. Note that tf.summary.histogram uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on their scope.
Traceback (most recent call last):
File "main.py", line 103, in
tf.app.run()
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "main.py", line 81, in main
sample_dir=FLAGS.sample_dir)
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 89, in init
self.build_model()
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 114, in build_model
self.D_, self.D_logits_ = self.discriminator(self.G, self.y, reuse=True)
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 324, in discriminator
h4 = linear(tf.reshape(h3, [self.batch_size, -1]), 1, 'd_h4_lin')
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/ops.py", line 98, in linear
tf.random_normal_initializer(stddev=stddev))
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 637, in _get_single_variable
found_var.get_shape()))
ValueError: Trying to share variable discriminator/d_h4_lin/Matrix, but specified shape (131072, 1) and found shape (32768, 1).
(dcgan) [sobey123@localhost DCGAN-tensorflow]$ vim train.sh
(dcgan) [sobey123@localhost DCGAN-tensorflow]$ ./train.sh
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
{'batch_size': 64,
'beta1': 0.5,
'checkpoint_dir': 'checkpoint',
'crop': False,
'dataset': 'market',
'epoch': 25,
'input_fname_pattern': '.jpg',
'input_height': 108,
'input_width': None,
'learning_rate': 0.0002,
'options': 1,
'output_height': 64,
'output_path': 'duke_result',
'output_width': None,
'sample_dir': 'samples',
'sample_size': 1000,
'train': False,
'train_size': inf,
'unrolled_lstm': False,
'visualize': False}
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:84:00.0
Total memory: 7.93GiB
Free memory: 7.83GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:84:00.0)
WARNING:tensorflow:From /home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py:109 in build_model.: histogram_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.histogram. Note that tf.summary.histogram uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on their scope.
Traceback (most recent call last):
File "main.py", line 103, in
tf.app.run()
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "main.py", line 81, in main
sample_dir=FLAGS.sample_dir)
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 89, in init
self.build_model()
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 114, in build_model
self.D_, self.D_logits_ = self.discriminator(self.G, self.y, reuse=True)
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 324, in discriminator
h4 = linear(tf.reshape(h3, [self.batch_size, -1]), 1, 'd_h4_lin')
File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/ops.py", line 98, in linear
tf.random_normal_initializer(stddev=stddev))
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 637, in _get_single_variable
found_var.get_shape()))
ValueError: Trying to share variable discriminator/d_h4_lin/Matrix, but specified shape (8192, 1) and found shape (25088, 1).

I just run the conda env create -f dcgan.yml command, activate virtualenv and then python main.py --dataset market --options 1

It seems this line of code causes the problem:
model.py line 114
self.D_, self.D_logits_ = self.discriminator(self.G, self.y, reuse=True)

why 2 discriminator?
Many thanks in advance!

gq · Answer 1 · Thu Jun 07 2018 20:11:14 GMT+0800 (China Standard Time)

hey, i alter the source code of main.py, just change the value of input_height and output_height to 128. and run the source code to see whether this problem can be solved?

fei · Answer 2 · Fri Jun 08 2018 17:26:26 GMT+0800 (China Standard Time)

Problem solved, thanks for helping!!
One more question, how long did it took for training the dcgan on market1501?
I'm now on epoch 300, but the sample images are still poor, my d_loss is small and g_loss trends to be growing with the epoch goes on.

I'm unfimilar with GAN, but according to the loss function proposed by the paper:

it seems tha g_loss should be small and d_loss should be big, I doubt that 300 epochs may far from enough.