FailSpy / abliterator

Simple Python library/structure to ablate features in LLMs which are supported by TransformerLens

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatic agonistic/antagonistic token finding

FailSpy opened this issue · comments

Set up cache_activations' measure_refusals argument to be able to find a distribution of tokens between the positive and negative and find the tokens that are most likely to appear where.

I think one thing that can be used to improve the scoring here is to account for token position in the assistant's response.
e.g. "Sure! Here's what you do" vs "Here's the problem with that"

While specific to "refusal" analysis, a semantic classifier could be implemented to determine if the response is a refusal or not. Such a model probably already exists. Safety Prompts would be a good place to start.

Alternatively, semantic similarity metrics could do the trick, introducing minimal overhead and generalizing across outputs beyond refusal. However, this is less robust.

Of course, none of these are mutually exclusive regarding implementation and use.