ml-jku / helm

I'm testing HELM in an applied project, and word embeddings have a gradient attached (here:

Line 108 in b2bfb0d

word_embs = self.model.word_emb(torch.arange(n_tokens)).to(device)

you don't detach). It is strange because locally when you convert the hidden state to numpy, it raises an error because there is a gradient attached and you don't detach it before converting to numpy. However, I think your version runs correctly so either in your version (I mean with your versions of the libraries you're using) gradients are indeed attached to embeddings (I don't know from a code point of view why they shouldn't) and it doesn't raise an error for obscure reasons, or for some unknown reasons gradients are not attached.

Could it be that you're actually optimizing the embeddings without knowing it? (and that the gradient of the embeddings is flowing through the TrXL ?).

Problem solved: indeed embeddings have gradient attached flowing through TrXL, but you're running forward() only inside a torch.no_grads so this is not happening.

Hi,
thank you for your interest in our work.
As you already found, during the rollout collection no gradient information is used.
Also in the forward() method of HELM the hiddens coming from the TrXL are detached, before being passed back:

helm/model.py

Line 154 in b2bfb0d

    
           return action.cpu().numpy(), values.cpu().numpy(), log_prob.cpu().numpy().squeeze(), hiddens

Even if the FrozenHopfield mechanism is used in isolation and gradients would be propagated through the TrXL embedding matrix, the token embeddings would need to be re-instantiated such that the changed tokens are leveraged, which is not the case.
However, for the sake of completeness, I added the .detach() in the instantiation of the FrozenHopfield mechanism to avoid a memory leak when using the FrozenHopfield in isolation.

Do word embeddings have gradients attached ?