Support attention masking to prevent attention across EOT tokens

Question

Support attention masking to prevent attention across EOT tokens

achalddave opened this issue 6 months ago · comments

By default, openlm performs causal attention across all tokens in a sequence, even if this sequence contains multiple documents separated by EOT. This might be related to #194. I think we can start with supporting it just for xformers using BlockDiagonalCausalMask https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask
cc @sagadre

Samir Yitzhak Gadre · Answer 1 · Fri Jan 26 2024 02:39:42 GMT+0800 (China Standard Time)

nice find!

GeorgiosSmyrnis · Answer 2 · Thu Feb 29 2024 00:26:10 GMT+0800 (China Standard Time)

This seems to only help 0.1% compared to not using it with the current implementation #213. Not sure if we can get more benefits.

Samir Yitzhak Gadre · Answer 3 · Thu Feb 29 2024 08:54:18 GMT+0800 (China Standard Time)

@GeorgiosSmyrnis interesting! do u know what the impact on speed is? say at 1B with 2-4 nodes?

GeorgiosSmyrnis · Answer 4 · Thu Feb 29 2024 08:59:20 GMT+0800 (China Standard Time)

Not 100% sure for 1B, but in the smaller scales it was running at 1/3 of the speed (since the current implementation relies on different attention per document, so you cannot really use xformers directly as far as I understand).

There should be a way to do this better as far as performance goes, but given the marginal benefits in downstream performance I'm not sure if it's worth it.

Achal Dave · Answer 5 · Mon Apr 29 2024 07:41:39 GMT+0800 (China Standard Time)

Closing this as wontfix for now given we're not seeing improvements on test accuracy, and the current implementation slows down training. We can re-open if someone wants it or if we think there is a strong need for it in certain settings.