error in training

Question

error in training

wingz1 opened this issue 7 years ago · comments

I'm running using your parameters on the data you indicate in the README. Starts okay, but dies in the first epoch. Any ideas?
Thanks.

~/torch_codes/pytorch-CortexNet$ python -u main.py --mode MatchNet --size 3 32 64 128 256 --tau 0 --big-t 10 --log-interval 10 --cuda --view 2 --show-x_hat --epochs 30  --model model_02 --lr-decay 10 10 --data /work/CortexNet_Experiments/VDS35_data/preprocessed-data | tee last/train.log
CLI arguments: --mode MatchNet --size 3 32 64 128 256 --tau 0 --big-t 10 --log-interval 10 --cuda --view 2 --show-x_hat --epochs 30 --model model_02 --lr-decay 10 10 --data /work/CortexNet_Experiments/VDS35_data/preprocessed-data
Current commit hash: bc28dac4e6a1ad9abb11e2fbc48d310a85e9903a
Define image pre-processing
Define train data loader
Define validation data loader
Define model

---------------------------- Building model Model02 ----------------------------
Hidden layers: 4
Net sizing: (3, 32, 64, 128, 256, 970)
Input spatial size: 3 x (256, 256)
Layer 1 ------------------------------------------------------------------------
Bottom size: 3 x (256, 256)
Top size: 32 x (128, 128)
Layer 2 ------------------------------------------------------------------------
Bottom size: 32 x (128, 128)
Top size: 64 x (64, 64)
Layer 3 ------------------------------------------------------------------------
Bottom size: 64 x (64, 64)
Top size: 128 x (32, 32)
Layer 4 ------------------------------------------------------------------------
Bottom size: 128 x (32, 32)
Top size: 256 x (16, 16)
Classifier ---------------------------------------------------------------------
256 --> 970
--------------------------------------------------------------------------------

Create a MSE and balanced NLL criterions
Instantiate a SGD optimiser
Training epoch 1
Traceback (most recent call last):
  File "main.py", line 394, in <module>
    main()
  File "main.py", line 194, in main
    train(train_loader, model, (mse, nll_final, nll_train), optimiser, epoch)
  File "main.py", line 297, in train
    ce_loss, mse_loss, state, x_hat_data = compute_loss(x[t], x[t + 1], y[t], state)
  File "main.py", line 261, in compute_loss
    (x_hat, state_), (_, idx) = model(V(x_), state_)
  File "...anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "...torch_codes/pytorch-CortexNet/model/Model02.py", line 76, in forward
    s = state[layer - 1] or V(x.data.clone().zero_())
  File "...anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 123, in __bool__
    torch.typename(self.data) + " is ambiguous")
RuntimeError: bool value of Variable objects containing non-empty torch.cuda.FloatTensor is ambiguous

wingz1 · Answer 1 · Wed Aug 30 2017 02:46:41 GMT+0800 (China Standard Time)

I tried fixing by doing this in Model02.py.
Please advise if this is the behavior you intended.

#s = state[layer - 1] or V(x.data.clone().zero_())  
#Attempted fix 
if state[layer -1] is not None:
   s = state[layer - 1]
else:
   s = V(x.data.clone().zero_())

which gets past the error message above, but now I get a new one:

File "main.py", line 395, in <module>
    main()
  File "main.py", line 194, in main
    train(train_loader, model, (mse, nll_final, nll_train), optimiser, epoch)
  File "main.py", line 287, in train
    state = repackage_state(state)
  File "main.py", line 391, in repackage_state
    return list(repackage_state(v) for v in h)
  File "main.py", line 391, in <genexpr>
    return list(repackage_state(v) for v in h)
  File "main.py", line 386, in repackage_state
    if not h:
  File "...anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 123, in __bool__
    torch.typename(self.data) + " is ambiguous")
RuntimeError: bool value of Variable objects containing non-empty torch.cuda.FloatTensor is ambiguous

Any thoughts on this or the previous error?

wingz1 · Answer 2 · Wed Aug 30 2017 03:07:13 GMT+0800 (China Standard Time)

Eh, I think I fixed that one too. After that, I came across the other minor issue already reported and fixed that (with the video closing). Now it's finally training for me... we'll see where it goes.

def repackage_state(h):
    """
    Wraps hidden states in new Variables, to detach them from their history.
    """
    #print(h)
    if h is None:   # was "if not h:"   Must careful doing bool ops on non-bools
        return None
    elif type(h) == V:
        return V(h.data)
    else:
        return list(repackage_state(v) for v in h)

Alfredo Canziani · Answer 3 · Wed Sep 13 2017 01:41:53 GMT+0800 (China Standard Time)

Hi, sorry for been away for a bit (I am relocating to NYU, and I haven't been online much).

Oh, I see what's going on. They updated PyTorch and broke my code.
Before, empty Tensors, lists, None, and so on were all considered as equivalent to False in a logic expression. It looks like now they introduced a stricter policy on boolean operations on Tensors. More precisely:

RuntimeError: bool value of Variable objects containing non-empty torch.cuda.FloatTensor is ambiguous

Would you mind providing a patch for the bug you faced?
Thank you.

wingz1 · Answer 4 · Wed Sep 13 2017 04:25:46 GMT+0800 (China Standard Time)

I think I already posted the changes above. Regardless, here is a git diff if you haven't already fixed it.

git diff
diff --git a/data/VideoFolder.py b/data/VideoFolder.py
index 6207b49..444b70f 100644
--- a/data/VideoFolder.py
+++ b/data/VideoFolder.py
@@ -143,7 +143,7 @@ class VideoFolder(data.Dataset):
opened_video[0] = seek + 1 # update seek pointer
frame = next(opened_video[1]) # cache output frame
if last:

       opened_video[2]._close()  # close video file (private method?!)

       opened_video[2].close()  # close video file (private method?!)
       self.opened_videos[video_idx].remove(opened_video)  # remove o.v. item

   return frame

@@ -155,7 +155,7 @@ class VideoFolder(data.Dataset):
for video in self.opened_videos: # for every opened video
for _ in range(len(video)): # for as many times as pointers
opened_video = video.pop() # pop an item

           opened_video[2]._close()  # close the file

```
           opened_video[2].close()  # close the file
```
def _shuffle(self):
"""
diff --git a/main.py b/main.py
index 7e763b9..395f81f 100644
--- a/main.py
+++ b/main.py
@@ -382,7 +382,8 @@ def repackage_state(h):
"""
Wraps hidden states in new Variables, to detach them from their history.
"""

if not h:

#print(h)
if h is None:
return None
elif type(h) == V:
return V(h.data)
diff --git a/model/Model02.py b/model/Model02.py
index 1745156..0665499 100644
--- a/model/Model02.py
+++ b/model/Model02.py
@@ -73,7 +73,18 @@ class Model02(nn.Module):
state = state or [None] * (self.hidden_layers - 1)
for layer in range(0, self.hidden_layers): # connect discriminative blocks
if layer: # concat the input with the state for D_n, n > 1

           s = state[layer - 1] or V(x.data.clone().zero_())

           if state[layer -1] is not None:

```
               s = state[layer - 1]
```
```
           else:
```

               s = V(x.data.clone().zero_())

Alfredo Canziani · Answer 5 · Wed Sep 13 2017 04:37:33 GMT+0800 (China Standard Time)

@wingz1, please format your question in an intelligible way, using a decent Markdown syntax.
I cannot read anything here!
If you wish to provide a fix, please submit a pull request.

wingz1 · Answer 6 · Wed Sep 13 2017 04:43:14 GMT+0800 (China Standard Time)

Sorry, that was a copy/paste from the terminal for you to see the bug fixes you requested. It wasn't a question. It didn't display correctly. Not sure why it didn't display it as plain text.

Alfredo Canziani · Answer 7 · Thu Sep 14 2017 10:35:15 GMT+0800 (China Standard Time)

It didn't display it correctly because GitHub is Markdown enabled.
Which means, what you write here are Markdown command.
If you want to paste text, use three back ticks before and after such text.
Check https://guides.github.com/features/mastering-markdown/