Replicating 'Refusal Mechanism'

a replication of the initial experiments on the 'Refusal Mechanism'

Background

This post represents a step towards my understanding of model behaviour and how to align LLMs with our interests. When I first read the blog, it seemed approachable on the surface level, I felt I could track what the author was doing as well as their motivations and it felt like a good experiment to try and replicate.

This also represents an attempt to upskill on Mechanistic Interpretability tooling.

This post is based on . If you need a more indepth explanation, or a refresher, I suggest the reader goes through the blog and return, because this writeup just summarises my findings and assumes the reader is familiar with mech interp-related terms.


Summary


Setup

To measure the refusal behaviour, used logit[sorry] - logit[sure] as the metric My intuition is that this metric is quite lossy. See Takeaways for disucssion on this. . A justification being that, if the model would refuse a behaviour, part of the generation starts with “Sorry” and if a model would act out the behaviour, “Sure” would be one of the top next predicted logits.

Initially, I tried using the dataset of harmful and harmless objects used, but I ran into troubles making sense of the results. Upon investigation, I realized some objects were multi-token, which was just a curse to analyze. So I decided to cherry-pick objects that were single token, instead Link to dataset I used.

I followed Gemma instruction prompt template.


Results with Gemma 2-2B

Residual stream patching

Logit attribution for the residual stream at each layer

This doesn't compare cleanly with the results from . The absolute value of the refusal score for harmful logits appears to be higher here the absolute value of the refusal score for harmful logits in . For harmless logits, the opposite appears to be true.

Residual stream activation patching

Patching residual stream at each layer

The results at the obj token position as well as the last token position is expected. At the '.' token position, which would be henceforth regarded as the post obj token position, layers 8 - 15 seems to be carrying signals related to the refusal behaviour. Going forward, these layers are layers of interest.

Attention layer activatin patching

The resid_post at any layer can be decomposed into resid_post = resid_pre + attn_out + mlp_out . So let's see what's up with attn_out .

Patching residual stream at each layer

MLP Layer activation patching

I decided to run activation patching on the MLP out of each layer as well, just to see what gives.

Patching residual stream at each layer

In retrospect, this result makes sense. One can think of it as, patching in at the obj token position is analogous to replacing the harmless objects with the harmful object in the prompt. The refusals score at obj position is 0.888.

Attention heads activation patching

Patching residual stream at each layer

Setting an arbitrary threshold of 0.005, 11 heads were contributing to the refusal behaviour and this set of heads were selected to be the sufficient for the refusal behaviour.

Steering

With difference vector

With activation vector


Results with Gemma 2-9B

Residual stream attribution

Residual stream activation patching

Attention Layer activation patching

MLP Layer activation patching

Attention heads activation patching

Steering