pzelasko / daseg

Dialog Acts SEGmentation: Tools for dialog act research

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error while inference

YoussephAhmed opened this issue · comments

Hello again, I faced this error when the input text exceeded certain number of characters (15987) to be exact, or about 3k words approximately
" return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self "
form my search I found that this error means that I got an index out of the embed-dings range how could his happen? I checked the token and it was normal letters so it should not be OOV or something? any idea about what is going on here and how to fix it? thanks in advance.

I am not sure what went wrong. To understand the issue better, you can check the token indexes you got after the tokenizer and see which input words led to tokens beyond the size of the embedding layer weights.

Hey again, I noticed the error is due to the absolute positional embeddings in Longformer which was 4098 in the allenai/longformer-base-4096 base, any solution to avoid this in longer sequences rather than just split the input into smaller segments?

I mean is there a way to make the positional embedding relative not absolute, or to just set this max to be function in the input size? thanks in advance.

Ah, unfortunately with Longformer that would likely require re-training the model as an LM. You can try with XLNet, it uses a relative positional embedding scheme and IIRC I implemented context propagation between chunks. I believe there may be new architectures that can deal with even longer inputs available now (BigBird is one that comes to mind), if Huggingface model hub has them, maybe it wouldn't be too much work to make this code working with that.

Okay thank you, I am closing this issue