Error on running train_pose.sh

Question

Error on running train_pose.sh

jpbillzhou opened this issue 6 years ago · comments

I followed the steps of training. I used the get_lmdb.sh to get the data. When I ran the train_pose.sh by typing ./train_pose.sh 0, the following error happened. Could someone please help me?
I am using one 2GB GTX 1050 GPU and setting the batch size to 1. Here is the setting:
transform_param = dict(stride=8, crop_size_x=368, crop_size_y=368,
target_dist=0.6, scale_prob=1, scale_min=0.5, scale_max=1.1,
max_rotate_degree=40, center_perterb_max=40, do_clahe=False,
visualize=False, np_in_lmdb=17, num_parts=np)

Thanks

I0526 19:51:56.906980 11880 net.cpp:283] Network initialization done.
I0526 19:51:56.907923 11880 solver.cpp:60] Solver scaffolding done.
I0526 19:51:56.921468 11880 caffe.cpp:155] Finetuning from /home/billzhou/deeplearning/caffe_train/model/VGG_ILSVRC_19_layers.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
I0526 19:51:57.571835 11880 upgrade_proto.cpp:52] Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/billzhou/deeplearning/caffe_train/model/VGG_ILSVRC_19_layers.caffemodel
I0526 19:51:58.441491 11880 upgrade_proto.cpp:60] Successfully upgraded file specified using deprecated V1LayerParameter
I0526 19:51:58.461448 11880 upgrade_proto.cpp:66] Attempting to upgrade input file specified using deprecated input fields: /home/billzhou/deeplearning/caffe_train/model/VGG_ILSVRC_19_layers.caffemodel
I0526 19:51:58.461498 11880 upgrade_proto.cpp:69] Successfully upgraded file specified using deprecated input fields.
W0526 19:51:58.461503 11880 upgrade_proto.cpp:71] Note that future Caffe releases will only support input layers and not input fields.
I0526 19:51:58.461835 11880 net.cpp:761] Ignoring source layer pool1
I0526 19:51:58.462285 11880 net.cpp:761] Ignoring source layer pool2
I0526 19:51:58.465679 11880 net.cpp:761] Ignoring source layer pool3
I0526 19:51:58.471774 11880 net.cpp:761] Ignoring source layer conv4_3
I0526 19:51:58.471806 11880 net.cpp:761] Ignoring source layer relu4_3
I0526 19:51:58.471812 11880 net.cpp:761] Ignoring source layer conv4_4
I0526 19:51:58.471817 11880 net.cpp:761] Ignoring source layer relu4_4
I0526 19:51:58.471822 11880 net.cpp:761] Ignoring source layer pool4
I0526 19:51:58.471827 11880 net.cpp:761] Ignoring source layer conv5_1
I0526 19:51:58.471832 11880 net.cpp:761] Ignoring source layer relu5_1
I0526 19:51:58.471837 11880 net.cpp:761] Ignoring source layer conv5_2
I0526 19:51:58.471843 11880 net.cpp:761] Ignoring source layer relu5_2
I0526 19:51:58.471848 11880 net.cpp:761] Ignoring source layer conv5_3
I0526 19:51:58.471853 11880 net.cpp:761] Ignoring source layer relu5_3
I0526 19:51:58.471858 11880 net.cpp:761] Ignoring source layer conv5_4
I0526 19:51:58.471863 11880 net.cpp:761] Ignoring source layer relu5_4
I0526 19:51:58.471868 11880 net.cpp:761] Ignoring source layer pool5
I0526 19:51:58.471873 11880 net.cpp:761] Ignoring source layer fc6
I0526 19:51:58.471877 11880 net.cpp:761] Ignoring source layer relu6
I0526 19:51:58.471881 11880 net.cpp:761] Ignoring source layer drop6
I0526 19:51:58.471886 11880 net.cpp:761] Ignoring source layer fc7
I0526 19:51:58.471891 11880 net.cpp:761] Ignoring source layer relu7
I0526 19:51:58.471896 11880 net.cpp:761] Ignoring source layer drop7
I0526 19:51:58.471900 11880 net.cpp:761] Ignoring source layer fc8
I0526 19:51:58.471905 11880 net.cpp:761] Ignoring source layer prob
I0526 19:51:58.512106 11880 caffe.cpp:251] Starting Optimization
I0526 19:51:58.512147 11880 solver.cpp:279] Solving
I0526 19:51:58.512153 11880 solver.cpp:280] Learning Rate Policy: step
F0526 19:51:58.665480 11880 eltwise_layer.cpp:34] Check failed: bottom[i]->shape() == bottom[0]->shape()
*** Check failure stack trace: ***
@ 0x7fe4503ee5cd google::LogMessage::Fail()
@ 0x7fe4503f0433 google::LogMessage::SendToLog()
@ 0x7fe4503ee15b google::LogMessage::Flush()
@ 0x7fe4503f0e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fe450b0944d caffe::EltwiseLayer<>::Reshape()
@ 0x7fe450a6bba8 caffe::Net<>::ForwardFromTo()
@ 0x7fe450a6bf57 caffe::Net<>::Forward()
@ 0x7fe450a21700 caffe::Solver<>::Step()
@ 0x7fe450a22199 caffe::Solver<>::Solve()
@ 0x40ba59 train()
@ 0x407590 main
@ 0x7fe44f343830 __libc_start_main
@ 0x407db9 _start
@ (nil) (unknown)
Aborted (core dumped)

yw155 · Answer 1 · Sun May 27 2018 17:09:07 GMT+0800 (China Standard Time)

Hi @jpbillzhou, you see the error occurred at the layer of 'eltwise_layer'. You may need to check this layer in the proto file. Thanks.

jpbillzhou · Answer 2 · Mon May 28 2018 08:59:16 GMT+0800 (China Standard Time)

Hi, yw155:
Thank you for helping me. Here is the section about 'eltwise_layer' in the file named pose_train_test.prototxt which was created by setLays.py. I am new starter, could help check the following, or let me know how to check it?
Thanks

layer {
name: "data"
type: "CPMData"
top: "data"
top: "label"
data_param {
source: "/home/billzhou/deeplearning/Realtime_Multi-Person_Pose_Estimation/training/lmdb_trainVal"
batch_size: 1
backend: LMDB
}
cpm_transform_param {
stride: 8
max_rotate_degree: 40.0
visualize: false
crop_size_x: 368
crop_size_y: 368
scale_prob: 1.0
scale_min: 0.5
scale_max: 1.10000002384
target_dist: 0.600000023842
center_perterb_max: 40.0
do_clahe: false
num_parts: 56
np_in_lmdb: 17
}
}
layer {
name: "vec_weight"
type: "Slice"
bottom: "label"
top: "vec_weight"
top: "heat_weight"
top: "vec_temp"
top: "heat_temp"
slice_param {
slice_point: 38
slice_point: 57
slice_point: 95
axis: 1
}
}
layer {
name: "label_vec"
type: "Eltwise"
bottom: "vec_weight"
bottom: "vec_temp"
top: "label_vec"
eltwise_param {
operation: PROD
}
}
layer {
name: "label_heat"
type: "Eltwise"
bottom: "heat_weight"
bottom: "heat_temp"
top: "label_heat"
eltwise_param {
operation: PROD
}
}
layer {
name: "image"
type: "Slice"
bottom: "data"
top: "image"
top: "center_map"
slice_param {
slice_point: 3
axis: 1
}
}

jpbillzhou · Answer 3 · Mon May 28 2018 09:04:30 GMT+0800 (China Standard Time)

Another thing I want to ask why it doesn't run file eltwise_layer.cu instead of eltwise_layer.cpp? I have one GPU 2GB GTX 1050 installed.

yw155 · Answer 4 · Tue May 29 2018 17:47:37 GMT+0800 (China Standard Time)

Hi @jpbillzhou, I did not see the obvious errors in you proto file. You can delete some eltwise layers and only reserve one eltwise layer and repeat this procedure to see which eltwise layer has error. Thanks.

MartFire · Answer 5 · Fri Jun 08 2018 22:30:53 GMT+0800 (China Standard Time)

Hi @jpbillzhou, I have the same issue, did you solve it ?

MartFire · Answer 6 · Mon Jun 11 2018 16:49:49 GMT+0800 (China Standard Time)

If this helps anyone, I solved the issue by setting a batch size superior to one.

jpbillzhou · Answer 7 · Tue Jun 12 2018 22:49:21 GMT+0800 (China Standard Time)

I have already set the batch size to be 1. It is still not working on GTX1050 2 GB. However, I changed to GTX1060 with 6GB , it worked fine by set the batch size to be 4.

changkk · Answer 8 · Mon Dec 31 2018 11:35:04 GMT+0800 (China Standard Time)

I have the exactly same issue. The batch size should not be 1??? I am also using GTX1050 and using batch size 1 since batch size 2 is not working because of the memory. ...

weilanShi · Answer 9 · Tue Oct 27 2020 17:50:35 GMT+0800 (China Standard Time)

If this helps anyone, I solved the issue by setting a batch size superior to one.

it's work for me, thanks