mlfoundations / open_lm

A repository for research on medium sized language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support attention masking to prevent attention across EOT tokens

achalddave opened this issue · comments

By default, openlm performs causal attention across all tokens in a sequence, even if this sequence contains multiple documents separated by EOT. This might be related to #194. I think we can start with supporting it just for xformers using BlockDiagonalCausalMask https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask
cc @sagadre

nice find!

This seems to only help 0.1% compared to not using it with the current implementation #213. Not sure if we can get more benefits.

@GeorgiosSmyrnis interesting! do u know what the impact on speed is? say at 1B with 2-4 nodes?

Not 100% sure for 1B, but in the smaller scales it was running at 1/3 of the speed (since the current implementation relies on different attention per document, so you cannot really use xformers directly as far as I understand).

There should be a way to do this better as far as performance goes, but given the marginal benefits in downstream performance I'm not sure if it's worth it.

Closing this as wontfix for now given we're not seeing improvements on test accuracy, and the current implementation slows down training. We can re-open if someone wants it or if we think there is a strong need for it in certain settings.