Automatic agonistic/antagonistic token finding

Question

Automatic agonistic/antagonistic token finding

FailSpy opened this issue 3 months ago · comments

Set up cache_activations' measure_refusals argument to be able to find a distribution of tokens between the positive and negative and find the tokens that are most likely to appear where.

I think one thing that can be used to improve the scoring here is to account for token position in the assistant's response.
e.g. "Sure! Here's what you do" vs "Here's the problem with that"

trefecta · Answer 1 · Wed May 29 2024 03:48:41 GMT+0800 (China Standard Time)

While specific to "refusal" analysis, a semantic classifier could be implemented to determine if the response is a refusal or not. Such a model probably already exists. Safety Prompts would be a good place to start.

Alternatively, semantic similarity metrics could do the trick, introducing minimal overhead and generalizing across outputs beyond refusal. However, this is less robust.

Of course, none of these are mutually exclusive regarding implementation and use.