Help,getting error while training !

Question

Help,getting error while training !

Yujiehang opened this issue 7 years ago · comments

I followed the basic command and try to train the dataset, an error occurred.
Following is the command:
th train.lua -data dataset -style_image wave.jpg -cpu
Following is the output on the stdout:
`=> Generating list of images
| finding all validation images
| finding all training images
| saving list of images to /home/jiehang/texture_nets/gen/style.t7
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded data/pretrained/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Using TV loss with weight 0
Setting up texture layer 2 : relu1_1
Setting up texture layer 7 : relu2_1
Setting up texture layer 12 : relu3_1
Setting up texture layer 21 : relu4_1
Setting up content layer 23 : relu4_2
Optimize
/home/jiehang/torch/install/bin/luajit: /home/jiehang/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
/home/jiehang/torch/install/share/lua/5.1/cudnn/init.lua:171: assertion failed!
stack traceback:
[C]: in function 'assert'
/home/jiehang/torch/install/share/lua/5.1/cudnn/init.lua:171: in function 'toDescriptor'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:123: in function 'createIODescriptors'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:188: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
[C]: in function 'xpcall'
/home/jiehang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/jiehang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./src/descriptor_net.lua:28: in function 'forward'
train.lua:211: in function 'opfunc'
/home/jiehang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'optim_method'
train.lua:240: in main chunk
[C]: in function 'dofile'
...hang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/jiehang/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/jiehang/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./src/descriptor_net.lua:28: in function 'forward'
train.lua:211: in function 'opfunc'
/home/jiehang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'optim_method'
train.lua:240: in main chunk
[C]: in function 'dofile'
...hang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
`

Dmitry Ulyanov · Answer 1 · Wed Aug 02 2017 21:36:32 GMT+0800 (China Standard Time)

For some reason cudnn is used for cpu version. It is a bug, but still I do not think you will be training the models on cpu, as it takes too long. Just find a gpu and it will work :)