CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pointer-generator does not work on CPU

kylebgorman opened this issue · comments

With --accelerator cpu (or not specifying since that's the default):

  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/pointer_generator.py", line 142, in decode_step
    attention_probs.scatter_add_(
RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype

One of the argument tensors, I suppose, is on CPU and the other one is...IDK where. @Adamits, at your leisure: any insights here?

commented

In what environment does this happen? So far I cannot reproduce on my mac (2 GHz Quad-Core Intel Core i5). For pointer generator with or without features, nor by explicitly setting --accelerator cpu. I also tried either 1 or 2 layers.

commented

Ok yes I can reproduce. I can see that source_attention_weights are f16, whereas attention_probs are f32. Will try to fix.

I assume the issue is that I create attention_probs after PL makes everything an f16 (which I assume happens on the module?)

commented

I think I need to read upon how they implement this. I can see that (with --precision 16)self.dtype is f32 on the modules. The LSTM encoders return f32, but the linear layer before deocding https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L503 returns f16.

The docs are not that helpful as to what we need to keep in mind to manage this. I think they might force some things to stay in f32 to avoid numerical instability? I don't really know anything about pytorch handling of mixed precision but I did find this note in the PL docs:

In some cases, it is essential to remain in FP32 for numerical stability, so keep this in mind when using mixed precision. For example, when running scatter operations during the forward (such as torchpoint3d), computation must remain in FP32.

commented

Another note, if before line 503 linked above, I do source_attention_weights = source_attention_weights.type(self.dtype), everything seems to work fine. So it seems this source_attention_weights f16 is the only issue, possibly only for scatter_add_?

I might see if the issue is the same on GPU when I have some time.

commented

I think I can either 1) do this with with masking instead of scatter_add_, or 2) use zeros_like in pytorch so that attention_weights and source_attention_weights are the same dtype by default.

commented

The above two suggestions are not as straight forward as I was thinking. I have opted to simply generate the zeros tensor for attention_probs with the dtype of source_attention_weights, which also solves the problem.

This makes me think we should probably check if we can reproduce the same results on CPU and GPU with 16bit precision.

Closed in #59.