Running full model on V100 outputs last word

Question

Running full model on V100 outputs last word

dimitri320 opened this issue 5 years ago · comments

I'm running the full model on a V100 GPU on Google Cloud, and the only output I'm getting is the last word copied over and over again. I've tried changing the temperature and topk parameters, but to no avail. I'm using the 512 version (larger version).

Any advice would be greatly appreciated.

Nitish Shirish Keskar · Answer 1 · Tue Oct 22 2019 04:45:23 GMT+0800 (China Standard Time)

This seems symptomatic of not providing any control code. Can you try with the first token being Links ?

Dimitri Podoliev · Answer 2 · Tue Oct 22 2019 13:05:18 GMT+0800 (China Standard Time)

Yes I have, I've tried several control code actually with both the 512 and 256 models, in both cases the results were the same. This is what I got with Links just now:

Links https://cnn.com/bill-clinton-was-the-president president president president president president president president president president

What's interesting is that the lower memory version with 512 model works perfectly well!

I'm using Google Cloud Deep Learning VM with one NVIDIA Tesla V100, 12 vCPUs, 78 GB memory and 500Gb harddrive.

Dimitri Podoliev · Answer 3 · Tue Oct 22 2019 13:13:07 GMT+0800 (China Standard Time)

Also as a note, when I do source attribution on "I lost 10 lbs! Feeling great!" with 512 model I get:

Fitness ppl = 10753944.151308

While in your example:

Fitness ppl = 36.162671

Julien Chaumond · Answer 4 · Wed Oct 23 2019 04:40:37 GMT+0800 (China Standard Time)

FWIW I'm running on V100 on GCP and do not have the issues you describe.

Dimitri Podoliev · Answer 5 · Wed Oct 23 2019 05:43:48 GMT+0800 (China Standard Time)

@julien-c saw your pull request, it makes perfect sense. What I don't get, is how it splits the control code from the rest of the input string, as there is no mention of Control Codes in the master branch in generation.py right now?

Julien Chaumond · Answer 6 · Wed Oct 23 2019 05:46:10 GMT+0800 (China Standard Time)

Let's maybe discuss on the PR itself, but AFAIK a control code is just a BPE token like any other token.

Nitish Shirish Keskar · Answer 7 · Thu Oct 24 2019 00:57:02 GMT+0800 (China Standard Time)

@julien-c is right. The control code is just the first token, and the way it's setup, it's always in the vocabulary so it doesn't get split up. There is no special treatment of that token during inference.

Dimitri Podoliev · Answer 8 · Thu Oct 24 2019 04:40:15 GMT+0800 (China Standard Time)

@keskarnitish Patched the correct file, and still I get the last work copied over and over again.... And I don't get the error coming up that no control word was used, as I am using control words (Links, Books, Wikipedia).

And btw, the new commit works, when I start with a non control word, it shows me the warning. Thanks for that @julien-c !

Any advice where else to look for an answer?

PS: already spent 3 days on this, really don't know what else to do...

Nitish Shirish Keskar · Answer 9 · Thu Oct 24 2019 22:41:31 GMT+0800 (China Standard Time)

The only other thing that comes to mind is that you might be pointing to an empty/corrupted model folder. Can you delete and re-download? Maybe also try using pytorch_generation and point to the specific .data file?

Dimitri Podoliev · Answer 10 · Sat Oct 26 2019 18:09:41 GMT+0800 (China Standard Time)

I found a solution. For TensorFlow models you need to specify the path to the model folder (not to the .data file). For PyTorch, you need to specify the path all the way to the .data file.

@keskarnitish I'd recommend to add this explicitly in the instruction for Tensorflow, as right now it's unclear.

Julien Chaumond · Answer 11 · Tue Oct 29 2019 09:13:42 GMT+0800 (China Standard Time)

Maybe you could add an assert to the code @dimitri320

Julien Chaumond · Answer 12 · Tue Oct 29 2019 09:13:55 GMT+0800 (China Standard Time)

Glad you found the issue after all though