Pointer-generator does not work on CPU

Question

Pointer-generator does not work on CPU

kylebgorman opened this issue a year ago · comments

With --accelerator cpu (or not specifying since that's the default):

  File "/home/kbg/.miniconda3/lib/python3.9/site-packages/yoyodyne/models/pointer_generator.py", line 142, in decode_step
    attention_probs.scatter_add_(
RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype

One of the argument tensors, I suppose, is on CPU and the other one is...IDK where. @Adamits, at your leisure: any insights here?

Adam · Answer 1 · Thu Mar 09 2023 09:45:15 GMT+0800 (China Standard Time)

In what environment does this happen? So far I cannot reproduce on my mac (2 GHz Quad-Core Intel Core i5). For pointer generator with or without features, nor by explicitly setting --accelerator cpu. I also tried either 1 or 2 layers.

Kyle Gorman · Answer 2 · Thu Mar 09 2023 11:54:13 GMT+0800 (China Standard Time)

I am doing this: 0. Python 3.9.16 1. at HEAD 2. reinstalled from requirements.txt 3. Then my script is roughly: yoyodyne-train \ --experiment "${LANGUAGE}" \ --train "${TRAIN}" \ --dev "${DEV}" \ --model_dir "${MODEL_DIR}" \ --arch "${ARCH}" \ --embedding_size 300 \ --hidden_size 100 \ --batch_size 32 \ --dropout .3 \ --gradient_clip_val 3 \ --accelerator cpu \ --max_epochs 60 \ --log_every_n_step 20 \ --gradient_clip_val 3 \ --optimizer adam \ --learning_rate .001 \ --max_epochs 60 \ --seed 49 \ --precision 16 Actually, I think that's it, it's about precision. Commenting out that line makes it go away. Can you replicate now?

…

On Wed, Mar 8, 2023 at 8:45 PM Adam ***@***.***> wrote: In what environment does this happen? So far I cannot reproduce on my mac (2 GHz Quad-Core Intel Core i5). For pointer generator with or without features, nor by explicitly setting --accelerator cpu. I also tried either 1 or 2 layers. — Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OIG7ZKGO2HOM6PODTDW3EY3NANCNFSM6AAAAAAVUN42ZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Adam · Answer 3 · Thu Mar 09 2023 23:11:55 GMT+0800 (China Standard Time)

Ok yes I can reproduce. I can see that source_attention_weights are f16, whereas attention_probs are f32. Will try to fix.

I assume the issue is that I create attention_probs after PL makes everything an f16 (which I assume happens on the module?)

Adam · Answer 4 · Thu Mar 09 2023 23:34:41 GMT+0800 (China Standard Time)

I think I need to read upon how they implement this. I can see that (with --precision 16)self.dtype is f32 on the modules. The LSTM encoders return f32, but the linear layer before deocding https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L503 returns f16.

The docs are not that helpful as to what we need to keep in mind to manage this. I think they might force some things to stay in f32 to avoid numerical instability? I don't really know anything about pytorch handling of mixed precision but I did find this note in the PL docs:

In some cases, it is essential to remain in FP32 for numerical stability, so keep this in mind when using mixed precision. For example, when running scatter operations during the forward (such as torchpoint3d), computation must remain in FP32.

Adam · Answer 5 · Fri Mar 10 2023 04:10:39 GMT+0800 (China Standard Time)

Another note, if before line 503 linked above, I do source_attention_weights = source_attention_weights.type(self.dtype), everything seems to work fine. So it seems this source_attention_weights f16 is the only issue, possibly only for scatter_add_?

I might see if the issue is the same on GPU when I have some time.

Kyle Gorman · Answer 6 · Fri Mar 10 2023 09:10:34 GMT+0800 (China Standard Time)

scatter_add_ is insisting on the matching dtypes, I think, so one just needs to make its arguments dtype-conformant, right? I don't know immediately whether you want to cast up or down the precision spectrum, though.

…

On Thu, Mar 9, 2023 at 3:10 PM Adam ***@***.***> wrote: Another note, if before line 503 linked above, I do source_attention_weights = source_attention_weights.type(self.dtype), everything seems to work fine. So it seems this source_attention_weights f16 is the only issue, possibly only for scatter_add_? — Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OJAWAW3Z2OXUDTHFOLW3I2MTANCNFSM6AAAAAAVUN42ZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Adam · Answer 7 · Fri Mar 10 2023 11:48:53 GMT+0800 (China Standard Time)

I think I can either 1) do this with with masking instead of scatter_add_, or 2) use zeros_like in pytorch so that attention_weights and source_attention_weights are the same dtype by default.

Adam · Answer 8 · Fri Mar 10 2023 23:49:42 GMT+0800 (China Standard Time)

The above two suggestions are not as straight forward as I was thinking. I have opted to simply generate the zeros tensor for attention_probs with the dtype of source_attention_weights, which also solves the problem.

This makes me think we should probably check if we can reproduce the same results on CPU and GPU with 16bit precision.

Kyle Gorman · Answer 9 · Sat Mar 11 2023 03:47:05 GMT+0800 (China Standard Time)

Closed in #59.