Pointer-generator does not work on CPU
kylebgorman opened this issue · comments
With --accelerator cpu
(or not specifying since that's the default):
File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/pointer_generator.py", line 142, in decode_step
attention_probs.scatter_add_(
RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype
One of the argument tensors, I suppose, is on CPU and the other one is...IDK where. @Adamits, at your leisure: any insights here?
In what environment does this happen? So far I cannot reproduce on my mac (2 GHz Quad-Core Intel Core i5). For pointer generator with or without features, nor by explicitly setting --accelerator cpu
. I also tried either 1 or 2 layers.
Ok yes I can reproduce. I can see that source_attention_weights
are f16, whereas attention_probs
are f32. Will try to fix.
I assume the issue is that I create attention_probs
after PL makes everything an f16 (which I assume happens on the module?)
I think I need to read upon how they implement this. I can see that (with --precision 16
)self.dtype
is f32
on the modules. The LSTM encoders return f32
, but the linear layer before deocding https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L503 returns f16
.
The docs are not that helpful as to what we need to keep in mind to manage this. I think they might force some things to stay in f32 to avoid numerical instability? I don't really know anything about pytorch handling of mixed precision but I did find this note in the PL docs:
In some cases, it is essential to remain in FP32 for numerical stability, so keep this in mind when using mixed precision. For example, when running scatter operations during the forward (such as torchpoint3d), computation must remain in FP32.
Another note, if before line 503 linked above, I do source_attention_weights = source_attention_weights.type(self.dtype)
, everything seems to work fine. So it seems this source_attention_weights
f16 is the only issue, possibly only for scatter_add_?
I might see if the issue is the same on GPU when I have some time.
I think I can either 1) do this with with masking instead of scatter_add_
, or 2) use zeros_like
in pytorch so that attention_weights
and source_attention_weights
are the same dtype by default.
The above two suggestions are not as straight forward as I was thinking. I have opted to simply generate the zeros tensor for attention_probs
with the dtype of source_attention_weights
, which also solves the problem.
This makes me think we should probably check if we can reproduce the same results on CPU and GPU with 16bit precision.
Closed in #59.