yogesh

Can You Hide From a Natural Language Autoencoder?

TLDR: NLAs are a recent black box mech interp method for verbalizing model internals. I will be focusing on one of two components, the Activation Verbalizer (AV) which generates, in natural language, an explanation about the models internal activations. The main question I am trying to answer here is whether these NLAs can be 'fooled' easily. I ran two small stress tests on the AV. First, I prefix tuned an activation vector to make the AV output the opposite explanation, while preserving the original model's behavior when the tuned activation was patched in. By minimizing the NLL of a target explanation with penalties to keep the tuned activation from drifting too far. Second, I took this a step further and trained the Qwen model on different secret words from the taboo dataset, where it essentially learns a hidden word while mostly not saying it out loud. Then I added penalties during finetuning, and this made the word much harder for the AV to recover. But this comes with the caveat: some of this might just be the model partly forgetting how to play the taboo game.


Experiment 1: Prefix-Tuning Red Teaming for NLAs

First, I wanted to test a pretty simple thing: if we optimize against an AV, does it keep telling us what the model is "really" thinking, or does it just become a model component that can be gamed? The short answer is: yea pretty much it can be gamed.

So here I attacked the AV using a prefix tuning style red teaming attack, inspired by the idea of soft prompting attacks, on a wealth seeking multiple choice eval, which is a small part of this dataset from Anthropic.

wealth

For each sample, we have a source label after running inference with Qwen: either wealth seeking or not wealth seeking. Then I took the activation vector and optimized a new vector for that specific sample. The goal was to make the AV prefer the opposite natural language explanation: The model is seeking wealth vs. The model is not seeking wealth

This is not a universal attack vector. It is just one optimized vector per example. This is just to test if I could optimize directly against the AV objective, to see how brittle is the interface. The loss we minimize the following:

attack(v)=max(NLLAV(etargetv)m, 0)+λcos(1cos(v^,v^src))+λmseMSE(hpatched,hbase)+λklKL(pbaseppatched)

where v is the optimized activation vector and vsrc is the original source activation. etarget is the opposite 'target' explanation we want the AV to output, and m is the small NLL margin, set to 0.05 in this setup. hpatched and hbase are the base model's final-layer hidden states after patching vs before patching. ppatched and pbase are the corresponding next token distributions. The λ terms are just weights for each preservation penalty.

Outcome Rate
Before attack 26.5%
After attack 81.4%
Source model: answer label unchanged 99.6%

I used a judge model to ask whether the generated explanation actually implied wealth seeking and the baseline AV was already accidentally saying the wrong explanation on some of the samples. After the prefix attack, the AV explanation flipped to the target explanation and succeeded on most of the dataset. The source model's answer label was also almost unchanged.

source_preservation_trace_all_rows_std

Here we have the main loss i.e. target NLL that collapses quickly, and the following penalties: Cosine to the source vector which dips, then recovers. KL which ends up small, and final layer MSE is reduced.

This still feels like a real robustness problem. If an interp tool can be optimized into saying the opposite thing while the downstream behavior mostly seems unchanged, I would not treat it as a trustable monitor.

Experiment 2: Stealth Finetuning

Inspired by Neural Chameleons, the second experiment is just taking this idea to the next step. Instead of optimizing a vector, I finetuned Qwen2.5-7B-Instruct LoRAs on taboo word datasets, just like the setup from Activation Oracles (AOs). The model learns a secret word and is supposed to play a hinting game: give clues, do not just say the secret word.

The words I used were: cloud, gold, ship, smile

Each LoRA used 270 train examples and 30 held out examples. The training setup was standard LoRA, rank 32, alpha 64, dropout 0.05, 10 epochs, and learning rate 5e-5.

Then I evaluated whether the AV could recover the secret from activations. For each run, I generated responses to 24 prompts, extracted layer-20 activations at the response tokens, and decoded each activation with the AV. Recovery was scored by the same metric: case insensitive substring match for the secret word.

taboo

For the regular baseline LoRAs, the AV often recovered the secret word even when the model did not directly leak it in the response.

Cloud was the easiest: the AV explanation hit rate was ~60%. Gold, ship, and smile were lower but still very much not zero, ~40%. The base control was basically almost zero by comparison. So this does not look like the AV just randomly says these words all the time. It looks like the finetune leaves a readable activation signature.

Then I trained preservation variants. During finetuning, I compared the LoRA-enabled model to the base model and added one of:

To put it mathematically the baseline taboo LoRA minimizes cross entropy:

CE=1|𝒜|t𝒜logpθ(xtx<t)

where 𝒜 is the set of assistant token positions. θ are the trainable LoRA parameters, xt is the token at position t.

The preservation stealth runs add a penalty. The activation terms compare against the base model:

stealth=CE+preserve

For MSE preservation:

preserve=λmse·1|𝒯|t𝒯hθ,t(20)hbase,t(20)22hbase,t(20)22+ϵ

where hθ,t(20) is the layer-20 hidden state at token t with the current LoRA enabled, while hbase,t(20) is the same hidden state but on the base source model. The denominator makes this a penalty scaled by the size of the base activation. ϵ is just a tiny constant to avoid div by zero.

For cosine preservation:

preserve=λcos·1|𝒯|t𝒯(1cos(hθ,t(20),hbase,t(20)))

For the combined with KL next-token term:

Dkl-baseline=1|𝒯|t𝒯KL(pθtaboo(·xt)pθ(·xt))

Here θtaboo is the regular LoRA finetune trained without stealth. For the combined preservation runs, I used an outer strength λ and constant weights:

preservecombined=λ(0.1·Dmse+0.2·Dcos)

and:

preservecombined_kl=λ(0.1·Dmse+0.2·Dcos+0.02·Dkl-baseline)

where Dmse and Dcos are the unweighted averages.

The main result is that this preservation mechanic makes the AV much worse at recovering the secret.

recovery_comparison

That looks like a strong result at first glance. But again, the caveat is doing a lot of work.

Lower NLA recovery can mean at least three different things:

  1. The model still learned the secret, but now hides the activation features the AV knows how to read.
  2. The model partly stopped learning or expressing the taboo behavior.
  3. The model has learnt the secret, but now also tries to mimic base model fully in behavior sometimes to hide

As for the last one there is not really a proper way of telling whether this happens. But to address the other two this plot below is the sanity check for this. The bottom right is the interesting region: low NLA recovery, high behavior retention. The bottom left is much less interesting, because it's just collapse.

pareto_behavior_vs_recovery

The combined with KL variant was added in hope to regain more of the behavior from regular finetuning. The KL term keeps next token predictions close to the regular taboo LoRA, rather than close to the base model, this was so that we can keep that behavior score up.

Overall we often get low NLA recovery while keeping behavior retention above 50%, but even then the behavior is weaker than the regular taboo LoRA. The baseline LoRAs are around 88-100% behavior retention, the good preservation runs are more like ~75%. But we can see from the plot a clear trend that the added KL penalty does help keep the behavior slightly. Possibly increasing the constant for KL might improve things even more.

So I wouldn't say that we hid the secret for free. We can push the finetuning into a situation where the AV stops verbalizing the secret, but there is a tradeoff with how strongly the model retains the target behavior.

The drift plots below makes this mechanism pretty intuitive. Regular LoRAs drift a lot at layer 20 from base, whether its MSE or cosine. Some penalties drive that down to around almost 0. As the layer-20 drift disappears, NLA recovery mostly disappears too.

drift_vs_recovery

This is both unsurprising but important. The AV can only verbalize the activation features it knows how to read. If the finetune is constrained not to move those layer-20 activations much, the AV has less to grab onto.

The training curves are also worth looking at if you care about the mechanics. The CE still goes down under preservation, and the penalty term stays bounded, but CE is not enough to tell you whether the on-policy taboo behavior survived. That was the whole reason for adding the behavior retention judge.

training_curves_cloud

Limitations

The wealth attack is not a universal vector. It shows that the AV objective is easy to optimize when you are allowed to optimize it directly.

The taboo recovery score is substring based, in order to keep the same eval pipeline as in AOs. It can miss semantic recovery and can count dumb string accidents. A judge model would probably be better. The taboo experiment only covers four words, one base model, one AV, and layer 20.

What I think this means

I think the main lesson is that NLA activation verbalizers should be evaluated adversarially by default. They are very convenient. They produce natural language explanations. But the interface is still a learned model. It can have biases, blind spots, and exploitable directions. The activation space is vast, it would be really hard for a NLA to be trained on a particular dataset and provide plausible explanations for any activation.

Both experiments point at the same boring but important thing: The AV explanation is just a model's guess about the activation. This sounds obvious, but a lot of the appeal of NLAs comes from the fact that the output is so legible. You get a nice explanation and it becomes tempting to treat that explanation as the thing itself.

This is basically the pragmatic interp trap in action. If the metric is "does the AV say the right/wrong word?", then sure, we can move the metric. But if the scoreboard is "did we understand what changed inside the model?", the answer is much less satisfying.

So if we want to use activation verbalizers for monitoring or safety work, I think the next step is not just to scale them up. It is adversarial evaluation, optimize against them, train around them, test whether they remain calibrated, and distinguishing between "the model stopped having it" from "the verbalizer stopped seeing it." Otherwise we are not really interpreting the model. We are just asking another model to narrate an activation, and then hoping for the right fragile explanation.