Encounter error when using models "pyramid" and "skip_unpool" in training and testing

Question

Encounter error when using models "pyramid" and "skip_unpool" in training and testing

michaelhuang74 opened this issue 7 years ago · comments

No error when using model "johnson" in training. However, when try to use model "pyramid" in training, an error occurred.

Following is the command:
th train.lua -style_image style/witch.jpg -style_size 600 -checkpoints_path checkpoint/ -checkpoints_name witch.sw3.p. -style_weight 3 -model pyramid -num_iterations 10000 -batch_size 2

Following is the output on the stdout:
torch.display not found. unable to plot
Using TV loss with weight 1e-06
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded data/pretrained/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Setting up texture layer 4 : relu1_2
Setting up texture layer 9 : relu2_2
Setting up texture layer 14 : relu3_2
Setting up content layer 23 : relu4_2
Setting up texture layer 23 : relu4_2
Optimize
/home/mqhuang/torch/install/bin/luajit: /home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 1 module of nn.Concat:
In 12 module of nn.Sequential:
/home/mqhuang/torch/install/share/lua/5.1/torch/Tensor.lua:457: expecting a contiguous tensor
stack traceback:
[C]: in function 'assert'
/home/mqhuang/torch/install/share/lua/5.1/torch/Tensor.lua:457: in function 'view'
./InstanceNormalization.lua:71: in function 'updateGradInput'
/home/mqhuang/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:91: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:47>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
train.lua:150: in function 'opfunc'
/home/mqhuang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'optim_method'
train.lua:175: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004065d0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
train.lua:150: in function 'opfunc'
/home/mqhuang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'optim_method'
train.lua:175: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004065d0

The OS is ubuntu 14.04. I have one Tesla K40 and one Titan X-Pascal in the system. The error happens to both GPUs.

Any idea is appreciated.

Dmitry Ulyanov · Answer 1 · Fri Dec 09 2016 15:26:03 GMT+0800 (China Standard Time)

The easiest way to fix it is to add nn.Contiguous before every instance norm module. Like that:
https://gist.github.com/DmitryUlyanov/f8c455585c1c2d8a9d14f5d914c2b57b

michaelhuang74 · Answer 2 · Sat Dec 10 2016 04:22:09 GMT+0800 (China Standard Time)

@DmitryUlyanov Thanks for the quick reply.

However, I copied your code in pyramid_.lua and tried the model pyramid again. I encountered the same error as follows.
torch.display not found. unable to plot
Using TV loss with weight 1e-06
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded data/pretrained/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
torch.display not found. unable to plot
Using TV loss with weight 1e-06
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded data/pretrained/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Setting up texture layer 4 : relu1_2
Setting up texture layer 9 : relu2_2
Setting up texture layer 14 : relu3_2
Setting up content layer 23 : relu4_2
Setting up texture layer 23 : relu4_2
Optimize
/home/mqhuang/torch/install/bin/luajit: /home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 1 module of nn.Concat:
In 16 module of nn.Sequential:
/home/mqhuang/torch/install/share/lua/5.1/torch/Tensor.lua:457: expecting a contiguous tensor
stack traceback:
[C]: in function 'assert'
/home/mqhuang/torch/install/share/lua/5.1/torch/Tensor.lua:457: in function 'view'
./InstanceNormalization.lua:71: in function 'updateGradInput'
/home/mqhuang/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:91: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:47>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
train.lua:150: in function 'opfunc'
/home/mqhuang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'optim_method'
train.lua:175: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004065d0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
train.lua:150: in function 'opfunc'
/home/mqhuang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'optim_method'
train.lua:175: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004065d0

There is a post on stackoverflow for discussing the similar issue: http://stackoverflow.com/questions/32188392/expecting-a-contiguous-tensor-error-with-nn-sum

I don't have the capability to modify pyramid.lua following the post at stackoverflow.com. In addition, I noticed that you also have normalization() in johnson.lua. Running model johnson will have no error.

michaelhuang74 · Answer 3 · Sat Dec 10 2016 05:42:13 GMT+0800 (China Standard Time)

When I set the batch_size to 1, I no longer encounter the "expecting a contiguous tensor" error with or without modifying the pyramid.lua model. I started a training with batch_size=1. The training stopped after 12000 iterations on Nvidia Tesla K40 although I requested to run for 40000 iterations. The GPU memory should not be the issue because K40 has 12 GB memory. The training simply terminated by itself after 12000 iterations.

The following was the command:
nohup th train.lua -style_image style/witch.jpg -style_size 600 -checkpoints_path checkpoint/ -checkpoints_name witch.sw3.mp. -model pyramid -num_iterations 40000 -batch_size 1 > train_0.out &

When batch_size is larger than 1, then no matter modifying pyramid.lua or not, I will encounter the "expecting a contiguous tensor" error.

Theo · Answer 4 · Sat Dec 10 2016 06:16:34 GMT+0800 (China Standard Time)

I had same experience and ended up using batch_size=1

michaelhuang74 · Answer 5 · Sat Dec 10 2016 10:45:43 GMT+0800 (China Standard Time)

I generated a couple of .t7 files using the model "pyramid". The size of .t7 files using model "johnson" is typically 20 MB. However, the size of .t7 files using model "pyramid" is around 7.5 MB.

For the training using both models, the parameters are the same as following.
-image_size 512, -style_size 600, -style_weight 3, -content_weight 1, -style_layers relu1_2,relu2_2,relu3_2,relu4_2 -content_layers relu4_2 ...

When I use test.lua to generate the stylized image, there is no error when using .t7 files with "johnson" model. However, I encounter the following error when using .t7 files with "pyramid".

Command: th test.lua -input_image inputimage/obama.jpg -model_t7 checkpoint/witch.sw3.mp.3000.t7 -save_path outputimage/obama.witch.sw3.mp.3000.jpg

/home/mqhuang/torch/install/bin/luajit: /home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 1 module of nn.Concat:
In 1 module of nn.Sequential:
In 1 module of nn.Concat:
In 1 module of nn.Sequential:
In 1 module of nn.Concat:
In 1 module of nn.Sequential:
In 1 module of nn.Concat:
In 1 module of nn.Sequential:
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:27: bad argument #1 to 'copy' (sizes do not match at /tmp/luarocks_cutorch-scm-1-6069/cutorch/lib/THC/THCTensorCopy.cu:31)
stack traceback:
[C]: in function 'copy'
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:27: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:9>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:14: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:9>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:14: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:9>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
merge.lua:41: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
merge.lua:41: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

Anyone has the same experience? How to resolve it? Thanks.

Dmitry Ulyanov · Answer 6 · Sat Dec 10 2016 20:09:48 GMT+0800 (China Standard Time)

I will take a look in a day

michaelhuang74 · Answer 7 · Sat Dec 10 2016 23:45:53 GMT+0800 (China Standard Time)

@DmitryUlyanov Thanks.

For model "skip_unpool", I have the same experience.

(1) I have to use batch_size=1 in training. The setting is same to the setting for model "johnson" and "pyramid". The size of the .t7 files with model "skip_unpool" is about 13 MB.

(2) When I try to use test.lua to generate the stylized image, I encounter the following error.

mqhuang@keplerGpu:~/texture_nets$ th merge.lua -input_image inputimage/obama.jpg -model_t7 checkpoint/witch.sw3.ms.18000.t7 -save_path outputimage/obama.witch.sw3.ms.18000.jpg
/home/mqhuang/torch/install/bin/luajit: /home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 2 module of nn.ConcatTable:
In 1 module of nn.Sequential:
In 2 module of nn.Concat:
In 8 module of nn.Sequential:
In 1 module of nn.Sequential:
In 2 module of nn.Concat:
In 8 module of nn.Sequential:
In 1 module of nn.Sequential:
In 2 module of nn.Concat:
In 8 module of nn.Sequential:
In 1 module of nn.Sequential:
In 2 module of nn.Concat:
In 8 module of nn.Sequential:
/home/mqhuang/torch/install/share/lua/5.1/nn/THNN.lua:110: bad argument #4 to 'v' (cannot convert 'struct THCudaTensor *' to 'struct THCudaLongTensor *')
stack traceback:
[C]: in function 'v'
/home/mqhuang/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SpatialMaxUnpooling_updateOutput'
...g/torch/install/share/lua/5.1/nn/SpatialMaxUnpooling.lua:18: in function <...g/torch/install/share/lua/5.1/nn/SpatialMaxUnpooling.lua:16>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:14: in function </home/mqhuang/torch/install/share/lua/5.1/nn/Concat.lua:9>
[C]: in function 'xpcall'
...
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/mqhuang/torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function <...e/mqhuang/torch/install/share/lua/5.1/nn/ConcatTable.lua:9>
[C]: in function 'xpcall'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
merge.lua:41: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004065d0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
merge.lua:41: in main chunk
[C]: in function 'dofile'
...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004065d0

michaelhuang74 · Answer 8 · Sun Dec 11 2016 05:47:30 GMT+0800 (China Standard Time)

@DmitryUlyanov

In #24, you mentioned that for "pyramid" model, the width and height of the input image used in test.lua should be a multiple of 32. I resized the input image from 1200x1200 to 800x800. Test.lua was able to generate the output image using the .t7 file based on "pyramid".

For the .t7 files based on "skip_unpool" model, even the width and height of the input image are multiples of 32, test.lua still encountered error as above.

Dmitry Ulyanov · Answer 9 · Sun Dec 11 2016 06:08:11 GMT+0800 (China Standard Time)

@michaelhuang74 The error is

/home/mqhuang/torch/install/share/lua/5.1/nn/THNN.lua:110: bad argument #4 to 'v' (cannot convert 'struct THCudaTensor *' to 'struct THCudaLongTensor *')

Torch is currently changing from a float32 GPU type only to support any types. The modules are being updated, but there are outdated modules too. It seems like either you use old nn and cunn or nn.SpatialMaxUnpooling has not been updated yet. This is kind of bug I cannot handle, since it was definitely working some time ago...

Dmitry Ulyanov · Answer 10 · Sat Dec 17 2016 21:07:52 GMT+0800 (China Standard Time)

Should work now.

michaelhuang74 · Answer 11 · Sun Dec 18 2016 04:16:06 GMT+0800 (China Standard Time)

@DmitryUlyanov Thanks for the updated InstanceNormalization.lua.

With the updated InstanceNormalization.lua, I am about to train the style image with pyramid model of batch_size > 1.

However I find that using the .t7 file of batch_size > 1 (e.g., batch_size = 4), the output of test.lua is pure black, as the issue in #45. For the .t7 file of batch_size = 1, test.lua is able to produce normal output.

Dmitry Ulyanov · Answer 12 · Tue Dec 20 2016 14:09:50 GMT+0800 (China Standard Time)

@michaelhuang74, I cannot reproduce the error, I tried both johnson and pyramid models with batch_size=2 at train time -- both worked fine in test time too.

michaelhuang74 · Answer 13 · Wed Dec 21 2016 08:26:00 GMT+0800 (China Standard Time)

@DmitryUlyanov, I tested again today for batch_size = 1, 2, 4 for pyramid. All generated good output in test time. The case three days ago might be random.

Thanks.