Loss thresholds for successful triggers on language models?

Question

Loss thresholds for successful triggers on language models?

mathemakitten opened this issue 4 years ago · comments

Hi Eric! Thanks for sharing this work. I've implemented this in Tensorflow to use with a dupe of the 124M GPT-2 model and was wondering if you could provide some details on the range of final "best loss" #s you were seeing with the smallest model and the triggers which worked (I'm working under the assumption that on a vocab size of 50k that cross entropy of ~10.8 ish would be equivalent to "random"). My current process isn't producing triggers which are successfully adversarial and I'm wondering if perhaps I'm just not finding very good triggers. Thanks!

Eric Wallace · Answer 1 · Wed Dec 02 2020 01:32:21 GMT+0800 (China Standard Time)

Hey sorry I never responded! Did you figure out the issue?

Eric Wallace · Answer 2 · Wed Dec 02 2020 01:32:30 GMT+0800 (China Standard Time)

Feel free to reopen the issue if not @mathemakitten

helen · Answer 3 · Wed Dec 02 2020 01:35:39 GMT+0800 (China Standard Time)

No worries, I ended up pulling your repo to find out. Ended up reproducing this in Tensorflow 2.3 so could add support for that if it's of interest!