Clarifying the models available on HF

Question

Clarifying the models available on HF

yair-schiff opened this issue 8 months ago · comments

Hi,

On the LongSafari HF space there appear to be 2 copies of each model, one with -hf at the end of the name and one without.

I was wondering what the difference is between these models (other than one being compatible with AutoModel), because despite the names being the same and the variables in the config files looking almost identical (i.e., same d_model and n_layers), they have very different number of parameters. For example,

LongSafari/hyenadna-medium-160k-seqlen has 6.6M parameters
LongSafari/hyenadna-medium-160k-seqlen-hf has 12.9M parameters

Which version of these models corresponds to the ones used in the paper experiments? If I am not mistaken, it should be the first one (i.e., the one without -hf in the name)?

Yair Schiff · Answer 1 · Sat Dec 23 2023 02:02:36 GMT+0800 (China Standard Time)

@exnx, after digging into the two versions of each model, it appears that the main difference is in how the PositionalEmbedding modules are defined:

That is, in the repo here, the PositionalEmbedding model, has no learnable parameters:

        self.register("z", z, lr=lr_pos_emb)

because in the config files(i.e., in configs/experiment/hg38/hg38_hyena.yaml), lr_pos_emb = 0.0, so the code uses register_buffer (i.e., here)

However, on HF, the version of each model that has -hf in the name uses this modeling code:

    self.z = nn.Parameter(z, requires_grad=True)

This increases the number of parameters for the -hf version of each model, especially for long sequence models.

So I guess my question is which of these would be the "correct" model to compare to and which was used in the paper's experiments?

Matt · Answer 2 · Thu Jan 25 2024 00:59:40 GMT+0800 (China Standard Time)

Hi @yair-schiff, I think the version in this repo is more authoritative. This was an error in the HF port - I'll submit a fix soon, and hopefully the two versions should be equivalent after that!

Yair Schiff · Answer 3 · Thu Jan 25 2024 01:04:58 GMT+0800 (China Standard Time)

@Rocketknight1, thanks for following up. I should have posted here as well after I did some digging. The two models have equivalent weights. As you mention, I think it was just a small discrepancy in the HF port that set the z parameter to "learnable". Thanks!

Matt · Answer 4 · Thu Jan 25 2024 01:26:04 GMT+0800 (China Standard Time)

@yair-schiff No probs! The code for the -hf models should now be updated with z as a buffer instead of a parameter.