error in training
wingz1 opened this issue · comments
I'm running using your parameters on the data you indicate in the README. Starts okay, but dies in the first epoch. Any ideas?
Thanks.
~/torch_codes/pytorch-CortexNet$ python -u main.py --mode MatchNet --size 3 32 64 128 256 --tau 0 --big-t 10 --log-interval 10 --cuda --view 2 --show-x_hat --epochs 30 --model model_02 --lr-decay 10 10 --data /work/CortexNet_Experiments/VDS35_data/preprocessed-data | tee last/train.log
CLI arguments: --mode MatchNet --size 3 32 64 128 256 --tau 0 --big-t 10 --log-interval 10 --cuda --view 2 --show-x_hat --epochs 30 --model model_02 --lr-decay 10 10 --data /work/CortexNet_Experiments/VDS35_data/preprocessed-data
Current commit hash: bc28dac4e6a1ad9abb11e2fbc48d310a85e9903a
Define image pre-processing
Define train data loader
Define validation data loader
Define model
---------------------------- Building model Model02 ----------------------------
Hidden layers: 4
Net sizing: (3, 32, 64, 128, 256, 970)
Input spatial size: 3 x (256, 256)
Layer 1 ------------------------------------------------------------------------
Bottom size: 3 x (256, 256)
Top size: 32 x (128, 128)
Layer 2 ------------------------------------------------------------------------
Bottom size: 32 x (128, 128)
Top size: 64 x (64, 64)
Layer 3 ------------------------------------------------------------------------
Bottom size: 64 x (64, 64)
Top size: 128 x (32, 32)
Layer 4 ------------------------------------------------------------------------
Bottom size: 128 x (32, 32)
Top size: 256 x (16, 16)
Classifier ---------------------------------------------------------------------
256 --> 970
--------------------------------------------------------------------------------
Create a MSE and balanced NLL criterions
Instantiate a SGD optimiser
Training epoch 1
Traceback (most recent call last):
File "main.py", line 394, in <module>
main()
File "main.py", line 194, in main
train(train_loader, model, (mse, nll_final, nll_train), optimiser, epoch)
File "main.py", line 297, in train
ce_loss, mse_loss, state, x_hat_data = compute_loss(x[t], x[t + 1], y[t], state)
File "main.py", line 261, in compute_loss
(x_hat, state_), (_, idx) = model(V(x_), state_)
File "...anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "...torch_codes/pytorch-CortexNet/model/Model02.py", line 76, in forward
s = state[layer - 1] or V(x.data.clone().zero_())
File "...anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 123, in __bool__
torch.typename(self.data) + " is ambiguous")
RuntimeError: bool value of Variable objects containing non-empty torch.cuda.FloatTensor is ambiguous
I tried fixing by doing this in Model02.py
.
Please advise if this is the behavior you intended.
#s = state[layer - 1] or V(x.data.clone().zero_())
#Attempted fix
if state[layer -1] is not None:
s = state[layer - 1]
else:
s = V(x.data.clone().zero_())
which gets past the error message above, but now I get a new one:
File "main.py", line 395, in <module>
main()
File "main.py", line 194, in main
train(train_loader, model, (mse, nll_final, nll_train), optimiser, epoch)
File "main.py", line 287, in train
state = repackage_state(state)
File "main.py", line 391, in repackage_state
return list(repackage_state(v) for v in h)
File "main.py", line 391, in <genexpr>
return list(repackage_state(v) for v in h)
File "main.py", line 386, in repackage_state
if not h:
File "...anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 123, in __bool__
torch.typename(self.data) + " is ambiguous")
RuntimeError: bool value of Variable objects containing non-empty torch.cuda.FloatTensor is ambiguous
Any thoughts on this or the previous error?
Eh, I think I fixed that one too. After that, I came across the other minor issue already reported and fixed that (with the video closing). Now it's finally training for me... we'll see where it goes.
def repackage_state(h):
"""
Wraps hidden states in new Variables, to detach them from their history.
"""
#print(h)
if h is None: # was "if not h:" Must careful doing bool ops on non-bools
return None
elif type(h) == V:
return V(h.data)
else:
return list(repackage_state(v) for v in h)
Hi, sorry for been away for a bit (I am relocating to NYU, and I haven't been online much).
Oh, I see what's going on. They updated PyTorch and broke my code.
Before, empty Tensor
s, list
s, None
, and so on were all considered as equivalent to False
in a logic expression. It looks like now they introduced a stricter policy on boolean operations on Tensor
s. More precisely:
RuntimeError: bool value of Variable objects containing non-empty torch.cuda.FloatTensor is ambiguous
Would you mind providing a patch for the bug you faced?
Thank you.
I think I already posted the changes above. Regardless, here is a git diff if you haven't already fixed it.
git diff
diff --git a/data/VideoFolder.py b/data/VideoFolder.py
index 6207b49..444b70f 100644
--- a/data/VideoFolder.py
+++ b/data/VideoFolder.py
@@ -143,7 +143,7 @@ class VideoFolder(data.Dataset):
opened_video[0] = seek + 1 # update seek pointer
frame = next(opened_video[1]) # cache output frame
if last:
-
opened_video[2]._close() # close video file (private method?!)
-
opened_video[2].close() # close video file (private method?!) self.opened_videos[video_idx].remove(opened_video) # remove o.v. item return frame
@@ -155,7 +155,7 @@ class VideoFolder(data.Dataset):
for video in self.opened_videos: # for every opened video
for _ in range(len(video)): # for as many times as pointers
opened_video = video.pop() # pop an item
-
opened_video[2]._close() # close the file
-
opened_video[2].close() # close the file
def _shuffle(self):
"""
diff --git a/main.py b/main.py
index 7e763b9..395f81f 100644
--- a/main.py
+++ b/main.py
@@ -382,7 +382,8 @@ def repackage_state(h):
"""
Wraps hidden states in new Variables, to detach them from their history.
"""
- if not h:
- #print(h)
- if h is None:
return None
elif type(h) == V:
return V(h.data)
diff --git a/model/Model02.py b/model/Model02.py
index 1745156..0665499 100644
--- a/model/Model02.py
+++ b/model/Model02.py
@@ -73,7 +73,18 @@ class Model02(nn.Module):
state = state or [None] * (self.hidden_layers - 1)
for layer in range(0, self.hidden_layers): # connect discriminative blocks
if layer: # concat the input with the state for D_n, n > 1
-
s = state[layer - 1] or V(x.data.clone().zero_())
-
if state[layer -1] is not None:
-
s = state[layer - 1]
-
else:
-
s = V(x.data.clone().zero_())
@wingz1, please format your question in an intelligible way, using a decent Markdown syntax.
I cannot read anything here!
If you wish to provide a fix, please submit a pull request.
Sorry, that was a copy/paste from the terminal for you to see the bug fixes you requested. It wasn't a question. It didn't display correctly. Not sure why it didn't display it as plain text.
It didn't display it correctly because GitHub is Markdown enabled.
Which means, what you write here are Markdown command.
If you want to paste text, use three back ticks before and after such text.
Check https://guides.github.com/features/mastering-markdown/