chuckcho / video-caffe

Video-friendly caffe -- comes with the most recent version of Caffe (as of Jan 2019), a video reader, 3D(ND) pooling layer, and an example training script for C3D network and UCF-101 data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

discrepancy in forward pass using test and deploy prototxt (or between C++ and python so to say)

dasabir opened this issue · comments

Issue summary

I see discrepancy in the output of the forward pass between the following two settings.

  1. Running the forward pass using test prototxt. Here the input data is specified inside the prototxt
  2. Running the forward pass using deploy prototxt. Here the input data has to be supplied from outside (i.e., not from inside the prototxt)

I’m attaching my codes for this test. Please find it here -
Forward_Pass.zip
. The test and deploy prototxts are also provided. The pretrained model I used is provided here. Although any pretrained model on ucf-101 should work.

As input to the test, I used the famous cat image. I just made 16 copies of it and is stored inside “Cat_images” folder. The 16 images will act as a dummy for a 16 length clip.

To run the forward pass with test prototxt I made use of ‘build/tools/predict’. The exact command I ran, can be found in the file “Forward_Pass_No_Python.txt”. Basically, I made use of the feature extraction code. I saved the input data and the output probabilities (cat.data and cat.prob respectively). After that I’m using the matlab helper functions to show the input data and the probabilities (read from these two saved files). The matlab script for this purpose is also supplied. It’s name is “Check_Blobs.m”.

The same forward pass is run via python too. The python script is named as “video_caffe_forward_pass.py”. Inside this script the function “runUsingTestPrototxt()” does this. You can see that this approach gives the same input and probability values as that given by the commandline tool (above).
The forward pass using deploy prototxt is entirely done in python (since, I’m not sure whether it can be done in commandline in the current form of video-caffe). Anyways, the forward pass using deploy prototxt is carried out in two different ways. One of them uses video-caffe’s preprocessing functions (using transformer class), another is without video-caffe’s native preprocessing. The functions are named as “runUsingDeployPrototxt()” and “runUsingDeployPrototxtWithoutCaffeTransformer()” respectively. It has been made sure that the input is same for all these functions. If you run any of the last two functions, you will see that for the same input (as the forward pass with test prototxt), the output probabilities are different. To make sure that the input is same in all cases, I’m outputting “net.blobs['data'].data” for all the cases.

Steps to reproduce

  1. Download the attached folder from here - Forward_Pass.zip
  2. You might need to rename the cat images so that you have 4 padding 0’s. I changed the code (to make it similar with Du Tran’s official C3D code) to have 6 padding 0’s. In short, you might need to change “000001.jpg” to “image_0001.jpg” and so on inside the folder “Cat_Images”. At the same time you might have to delete the trailing '/' at the end of the image directory name in "test_c3d_input_frm_cat.txt" (i.e., the line "Cat_Images/ 1 0" will become "Cat_Images 1 0").
  3. I’m using a model trained by video-caffe on ucf-101. I’m sharing it in a dropbox link so that you can use the same model. Otherwise any pretrained model on ucf-101 will work. Download the model and keep it inside the folder (of codes) you downloaded above.
  4. Run the command line tool to extract the probabilities. The command is given in “Forward_Pass_No_Python.txt”. You need to change the path to the you video_caffe installation instead of my “/scratch/workspace/video-caffe/”. You can see the values generated by this operation using the matlab script “Check_Blobs.m”. Here also, you have to change the video-caffe root path to yours (Line 6).
  5. Now to check the same in python you have to run “runUsingTestPrototxt()”. Just comment the other two function calls inside “main”, and uncomment “”runUsingTestPrototxt. As always, you have to change the video-caffe root (LIne 9)
  6. Similarly, the two tests with deploy prototxt can be run from the same python script. For these, you have to uncomment “runUsingDeployPrototxt” and “runUsingDeployPrototxtWithoutCaffeTransformer” respectively (keeping the rest two commented) inside the “main”. You will notice that (though the input is same), the output probabilities are different from the ones obtained by using test prototxt.

Your system configuration

Operating system: centos 7
CUDA version (if applicable): 8.0
CUDNN version (if applicable): 5.1
BLAS: open
Python or MATLAB version (for pycaffe and matcaffe respectively): python 2.7

@dasabir Thanks again for the detailed report of this issue. Let's start with some facts. I do get the following.

Check_Blobs.m output (I used python version instead because I don't have Matlab):

First 10 values of output probability
[  4.94098495e-05   3.68048670e-04   3.00217594e-04   7.00256526e-01
   1.01839280e-04   1.73078121e-07   1.79325440e-03   1.16771444e-04
   6.47046763e-05   1.31071295e-06]

runUsingTestPrototxt output:

First image of the clip, First channel, 3x3 pixel values
[[ 37.  36.  28.]
 [ 32.  33.  30.]
 [ 30.  25.  23.]]
First 10 values of output probability
[  4.94098495e-05   3.68048670e-04   3.00217594e-04   7.00256526e-01
   1.01839280e-04   1.73078121e-07   1.79325440e-03   1.16771444e-04
   6.47046763e-05   1.31071295e-06]

runUsingDeployPrototxt output:

First image of the clip, First channel, 3x3 pixel values
[[ 37.  36.  28.]
 [ 32.  33.  30.]
 [ 30.  25.  23.]]
First 10 values of output probability
[  2.79693813e-06   1.21948151e-05   1.37284442e-04   4.34822682e-03
   2.69704557e-04   1.42721192e-05   6.85010687e-04   2.37378641e-03
   5.17433327e-05   8.42382815e-06]

runUsingDeployPrototxtWithoutCaffeTransformer output:

First image of the clip, First channel, 3x3 pixel values
[[ 37.  36.  28.]
 [ 32.  33.  30.]
 [ 30.  25.  23.]]
First 10 values of output probability
[  2.79693813e-06   1.21948151e-05   1.37284442e-04   4.34822682e-03
   2.69704557e-04   1.42721192e-05   6.85010687e-04   2.37378641e-03
   5.17433327e-05   8.42382815e-06]

Also, extract_features.cpp and predict.cpp assumes 3D Datum (C, H, W), and haven't been modified to accommodate temporal dimension. Hence, the output is not to be trusted. video-caffe uses Ndim Blob and still uses 3-dim Datum, and it seems extending Datum to 4 or N-dim is not a trivial task per BVLC/caffe#2152.

@chuckcho Thanks for your insight. Yes, I can confirm that these are the exact values that I got after running the python script.
As you said video-caffe uses Ndim Blob and still uses 3-dim Datum, that naturally leads to some more questions.

  1. How to extract features from a trained model? Since, predict.cpp does not work as intended, is the pythonic way the solution? By 'pythonic way' I mean, completely bypassing the data load via caffe datum and use python io.py script (i.e., the transformer class) or even native pythonic way (i.e., using matplotlib imread function etc.)?
  2. The second question is - is the training then going on right? I mean during training vide-caffe uses the same caffe 3-D datum, right? Then the forward pass (or even backward pass) may not be issue free. This question may be stupid as I don't know much about most of the 'under the hood' activities in caffe. But, I certainly, feel that a way to test whether things are going on right, is to test the dataset using caffe commandline way and also the pythonic way (with our own data feedig mechanism bypassing inappropriate caffe datum) and get the same test accuracy. Is there a way to test that way?

I still believe the training is going on right, but we don't have a way to test it by any other means. That also means it is very hard to extend the framework to something else.

I have done some more analysis on this issue. In particular, I have tried to bypass the dependency on caffe code to minimal. For that purpose I wanted to check what value are we getting if we compare between the following two cases.

  1. use (python) video-caffe to read the model and perform convolution operation
  2. Use manual parsing/reading of filter weights and manual code to perform conv operation. By ‘manual’, I really mean customized code which does not use “caffe.Net()” to read model or “net.forward()” to perform forward pass. I just wrote this convolution operation for only “conv1a” operation.

I have attached the resulting python script, named “conv1a.py” here -
conv1a.py.zip

. There are two functions inside it.

  1. runUsingVideoCaffePython(): This serves the purpose as told in point 1 above.
  2. runUsingCustomCode(): This serves the purpose as told in point 2 above. The model is parsed using “caffe_pb2.NetParameter().ParseFromString” (Line 107) and convolution operation is performed by a (dirtly written) customized convolution operation - manualConvOp() (Line 129). Technically speaking it returns the non-ReLU-ed preactivations. But that is fine as ReLU-ing is not that hard :=)

For both the operations I am printing the input data, some conv filter parameters and some activations/preactivations in the console. You will notice that the for both video-caffe python and the customized code the outputs are same.
The point I want to make is that here we are showing that both using (python) video-caffe and not using it give the same result. Previously we have seen that using (python) video-caffe and using c++ code do not give the same output. So is the c++ approach, the correct one or the python one?

Steps to reproduce

  1. Download the attached code from here - conv1a.py.zip.
  2. Copy it to the folder where you copied previous codes (provided here)
  3. As always, you have to change the video-caffe root (LIne 9)
  4. The two cases can be run from the script. For the first option (i.e. (python) video-caffe way), please uncomment the line which says “runUsingVideoCaffePython(model_def_deploy_prototxt,inputFrmPath,length)” (Line 144) inside "main". You can keep the next line commented for the time being.
  5. Similarly, for the second option (i.e. customized way), please uncomment the line which says “runUsingCustomCode(inputFrmPath,length)” (Line 145) inside main . You can keep the other line (runUsingVideoCaffePython - Line 144) commented now.

I've checked the issue. The difference in output is because the input is different. That is, the input is NOT the same in all cases, although they are same in the [0,0,0,:3,:3] slice. You can see the difference in input (net.blobs['data'].data[...]) by changing the data slice [0,0,0,:3,:3] to [0,0,1,:3,:3].

Below is what I got:

  1. runUsingTestPrototxt output:
[[ 86.  80.  75.]
 [ 82.  77.  71.]
 [ 74.  69.  65.]]
  1. runUsingDeployPrototxt output:
[[ 37.  36.  28.]
 [ 32.  33.  30.]
 [ 30.  25.  23.]]

You can see that they are different, and obviously the former (runUsingTestPrototxt) is wrong, since the data should be same in each frame. This could be an issue (bug) in VideoDataLayer in video-caffe.

I'll take a further look on VideoDataLayer in video-caffe.

Merry Christmas!

@chuckcho After looking at the code, I think there's serious memory layout issue in Video-Caffe.

As far as I understand, video-caffe uses a number x channels x length x height x width (NCLHW) memory layout, row-major, as can be seen in https://github.com/chuckcho/video-caffe/blob/master/src/caffe/blob.cpp#L15-L19.

But Blob::offset method in https://github.com/chuckcho/video-caffe/blob/master/include/caffe/blob.hpp#L184

return (((n * length() + l) * channels() + c) * height() + h) * width() + w;

indicates a NLCHW memory layout, not NCLHW.

And this Blob::offset method is used by data transformer in https://github.com/chuckcho/video-caffe/blob/master/src/caffe/data_transformer.cpp#L252-L254

Dtype value = uni_blob.data_at(0, c, 0, h, w);
offset = transformed_blob->offset(0, c, item_id, h, w);
*(transformed_blob->mutable_cpu_data() + offset) = value;

So the data transformer, and consequently VideoDataLayer loads data into NLCHW memory layout, which violates video-caffe's NCLHW memory assumption (e.g. in cuDNN's ND convolution https://github.com/chuckcho/video-caffe/blob/master/src/caffe/layers/cudnn_ndconv_layer.cpp#L159).

So the issue is clear as follows:

  1. VideoDataLayer or Blob::offset assume a NLCHW memory layout.
  2. But all other layers such as ND convolution layer and pycaffe assume NCLHW memory layout.

@dasabir @chuckcho The simple solution is to change Blob::offset in https://github.com/chuckcho/video-caffe/blob/master/include/caffe/blob.hpp#L184 from

return (((n * length() + l) * channels() + c) * height() + h) * width() + w;

to

return (((n * channels() + c) * length() + l) * height() + h) * width() + w;

so that all memory layout is NCLHW.

@chuckcho Given that all your prototxt and trained models rely on VideoDataLayer, I recommend you check whether they are affected by this bug or incorrectly trained.

@ronghanghu thanks a lot for your insight. i believe that's what caused the discrepancy as the original poster, @dasabir reported. i'm testing this fix. thanks all!

I did get the identical results for all 4 test cases. So, i believe it's good to go. I'm also training on UCF-101. Hugh thanks to @dasabir and @ronghanghu.

@chuckcho and @ronghanghu , I pulled the latest commit (c1efc99) and tested the two cases (the two cases with which the post started, i.e., use of test prototxt and deploy prototxt and then using manual/custom conv operation). For all the cases, I now get the same result. I even tried with ronghang's sugegstion of outputting the 1st slice ([0,0,1,:3,:3]) of the input instead of the 0th slice ([0,0,0,:3,:3]) and now I'm getting the same pixel values. (Just as a side note:) For the custom conv code, I see slight difference in the preactivation values only after the 5th decimal place. But I think this is due to precision issues of python. I have experienced similar trend in other experiments too.
So, I think this issue is quite resolved. I will also start training a c3d model on ucf-101 soon. @chuckcho, please update us about the training you are running.
And thanks a lot guys!

I am using refactor branch and the blob::offset return statement is different from master branch (https://github.com/chuckcho/video-caffe/blob/refactor/include/caffe/blob.hpp)

return ((n * channels() + c) * height() + h) * width() + w;
( no length variable)

How do we make the above discuss changes in refactor branch.
Thanks

@AishaKhan refactor branch won't be affected by this issue as temporal data (or any non-spatial data) is handled by shape(): https://github.com/chuckcho/video-caffe/blob/refactor/include/caffe/blob.hpp#L166-L178

Here is my plot of the training/test loss and test accuracy on ucf-101 (split 1). I used a batch size of 50. Thus making 107258/50 ~ 2146 iterations per epoch. The plot is for each 1000 iteration. I did not calculate the training accuracy for the whole (or a big chunk of the) training data during training. So training loss is just (I think) on one batch of the training data. But this gives reasonable trend. For the test loss and test accuracy, I computed those on the whole test data. The test accuracy gets close to 40 % at around 11K iteration (roughly epoch 5) and saturates there.

c3d_video-caffe_analysis

@dasabir I'm seeing very similar trend for validation top1 accuracy and training loss. Was hoping to get ~45% by 6th epoch as in the original paper (Fig 2 in https://arxiv.org/pdf/1412.0767.pdf). Since @dasabir did extensive validation with using test prototxt, deploy prototxt and using manual/custom conv operation, I believe the issue is resolved. Thanks again a lot for your contribution!

Just a wild guess about not reaching ~45%. Is it that the crop is always central (i.e., not random) during training? The reason I'm saying this is that when I tested (with the models trained in the current setting) with resized test images (from 128x171 to 112x112) instead of central cropping (cropping 112x112 central region from 128x171 sized test images), the performance is consistently less by 3-4% (i.e. ~36.5% instead of 40%). Could anyone please check that?

During training, it's random crop (112x112) from 128x171: https://github.com/chuckcho/video-caffe/blob/master/src/caffe/data_transformer.cpp#L224-L227 I can't think of anything different than the original paper's setting.

it's been a while since we last discussed, but i believe the issue is fixed by now. I just did a fresh training and am getting 46~47% test accuracy. As such, will close this issue soon.