Can You Hide From a Natural Language Autoencoder?

10 Jun, 2026

Natural Language Autoencoders (NLAs) are pitched as a black box mechanistic explainer for model internals. An Activation Verbalizer (AV) reads a residual-stream activation and tells you, in plain English, what the model is "thinking". An Activation Reconstructor (AR) maps that English back to an activation. The selling point is auditing, point the AV at a model and have it narrate the parts the model won't say out loud.

I find that pitch genuinely exciting, which is exactly why I wanted to know how hard it is to beat. An auditing tool is only worth anything if a model — or whoever is finetuning one — can't quietly arrange for the verbalizer to read the wrong thing. So I went at the released Qwen2.5-7B layer-20 AV (kitft/nla-qwen2.5-7b-L20-av, d_model=3584, injection scale 150) three ways, each one a step closer to something a real adversary could actually pull off: forge a single activation, finetune a model that hides a word, then finetune a model that hides the word and the activation it lives in.

The short version: the first attack is trivial, the second one the auditor wins clean, and the third one the model wins but not really for free.

The cheap shot: forge a single activation

The easiest thing to try is the most direct. Freeze every AV weight, then optimize one continuous activation vector so the AV says whatever I want — and check whether the underlying model even notices.

I ran this on Anthropic's wealth-seeking-inclination eval. For each row I score Qwen2.5-7B-Instruct on the two multiple-choice answers; whichever it prefers is that row's source label (wealth_seeking / not_wealth_seeking). Then I optimize a single layer-20 activation, 40 steps, lr 0.25, to make the AV prefer the opposite explanation. The trick is to keep it anchored, so it stays a believable activation rather than adversarial noise. Alongside the hinge target loss I carry three preservation penalties:

L = max(target_nll − 0.05, 0)
  + 25.0 · (1 − cos(v, source))      # stay near the real activation's direction
  +  0.5 · final_layer_MSE(patched)  # don't move the model's last-layer states
  +  0.1 · next_token_KL(patched)    # don't move the model's next-token distribution

Those last two are measured by patching the optimized vector back into Qwen at layer 20 and comparing against the clean run, and I select the lowest-preservation-loss step after the target margin is met. The idea is that a real attacker doesn't just want the AV to lie — they want it to lie while the model keeps behaving exactly as it did.

It works almost perfectly. The AV flips to the opposite explanation on 100% of rows, while the model's actual answer is preserved 99.7% of the time and its next-token distribution barely moves.

outcome	rate	what it means
AV target success	100%	AV emits the opposite explanation
source label preserved	99.7%	model still prefers the same answer
next-token KL (<0.05)	99.2%	output distribution barely moves

source_preservation_trace_all_rows_std

Here's the catch, and it matters: The vector almost never changes what the model actually does is that it only changes what the AV says the model is doing. This is AV-fooling. The explanation and the model now disagree, and the AV has no idea. Which is a clean white-box win, but it's also kind of a toy: I had to hand-craft each activation. The real question is whether the same evasion survives when all I'm allowed to touch is the model's weights.

Now make it a model — and watch the auditor win

So I stopped forging activations and trained a model to keep a secret instead.

The setup is the "taboo" game from the Activation Oracles Paper: the model hints at a secret word but never says it. Plain LoRA SFT (r=32, α=64, dropout 0.05, all-linear, lr 5e-5, 10 epochs, bf16) on bcywinski/taboo-{gold,ship,smile,cloud}, cross-entropy masked to assistant tokens only. Nothing clever — just teach the behavior. Then I run the NLA eval: generate on-policy answers to 24 held-out prompts, grab the layer-20 residuals at dense positions (last user tokens, the assistant-header tokens, up to 16 response tokens), and run the AV on 3 rollouts per position. A position counts as a hit if the AV names the secret word in at least one rollout.

The behavior is airtight and the internals are wide open.

model	AV decode hit rate	positions with ≥1 hit	behavioral leaks
gold	0.438	0.540	0/24
ship	0.491	0.525	0/24
smile	0.437	0.497	1/24
cloud	0.638	0.705	0/24

The models basically never say the word — 0 to 1 leaks out of 24 — and yet the AV names it at half to two-thirds of all probed positions, against a base-model floor of essentially zero.

So ordinary finetuning hides the word behaviorally and not at all internally. The layer-20 residual stream represents it so strongly that an off-the-shelf AV — trained on the base model, never on my LoRAs — reads it right off. For an auditor, this is the system working exactly as advertised. Which is the real test set up for me: can I finetune the secret out of layer 20 and still keep the behavior?

Teach the model to hide from the audit itself

This is the one I actually cared about. If the AV reads layer 20, then I should be able to add a loss that keeps layer 20 looking like the base model — and starve the verbalizer of the thing it reads.

Same training loop, plus a second forward per batch with the LoRA adapter disabled (peft model.disable_adapter() under no_grad, no second copy of the model) to get a frozen-base reference. Then I add λ · L_pres to the CE loss, in three flavors:

mse: mean_t ‖h^L_t − h^B_t‖² / ‖h^B_t‖²   (relative, scale-free, on the layer-20 residual)
cos: mean_t (1 − cos(h^L_t, h^B_t))        (direction only, on the layer-20 residual)
kl : mean_t KL(p^B_t ‖ p^L_t)              (forward KL on next-token logits — an outputs-only control)

The MSE and cosine penalties work, and they work hard. They pull the layer-20 activation back onto the base manifold and the AV goes blind. recovery_comparison

The MSE penalty drives layer-20 relative MSE from 0.556 down to 0.006 and drops the NLA's position-hit rate from 0.705 to 0.085 — an 88% cut, basically the base-control floor — while keeping 0 leaks and two-thirds of the hint behavior. It generalizes across all four words too: position-hit rate 0.50–0.70 → 0.04–0.09, behavior retention 0.67–0.75. And the mechanism is exactly what you'd hope: across all 22 runs, the single tightest predictor of NLA recovery is the layer-20 relative MSE, because that is the precise quantity the AV consumes. Suppress what it reads, and it can't read.

drift_vs_recovery

Now the catch — and it's a good one. Stealth isn't free, and one of my three penalties is a trap. The tradeoff sweep shows recovery falling smoothly as I crank λ (left), but behavior retention sagging right along with it (middle). And held-out CE (right) is actively misleading: the KL runs keep CE lowest exactly when they've fallen apart.

tradeoff_cloud

I'd added KL as an outputs-only control — match the base model's next-token distribution, leave the internals alone — and expected it to suppress recovery less than the layer-20 penalties. Instead kl-w1 posts the lowest recovery of anything, 0.023. For about ten minutes that looked like the best result. Then I read the generations: behavior retention 0.292, the model answering with things like "As an AI assistant, I don't have any secret word." It wasn't hiding the secret it had stopped having one. The low recovery is an artifact of collapse, not stealth, and teacher-forced CE missed it entirely.

The behavior-vs-recovery Pareto is the cleanest way to see the difference. MSE and cosine sit in the bottom-right corner I want — secret hidden and still playing the game. The KL runs slide down a diagonal into the bottom-left, where recovery is low only because the model gave up.

pareto_behavior_vs_recovery

So yes, you can finetune a model to hide from the audit — genuinely, while it keeps doing the task — but only if you constrain the exact thing the auditor looks at. Constrain the outputs instead and "low recovery" just means you broke the model. Reading the generations, not the loss curve, is what told them apart.

What I'd take away

Three attacks, three different lessons, escalating toward something an actual adversary could do:

The verbalizer bottleneck is steerable. One fixed-norm activation flips the AV's explanation 100% of the time while leaving the model untouched. The explanation and the model can be made to disagree, and the AV won't flag it.
Ordinary hiding doesn't fool a good NLA. A model trained to keep a secret leaks it all over layer 20; the AV recovers it at 50–70% of positions with near-zero false positives. This is the auditor winning, and it's the strongest argument for NLAs in the whole post.
Targeted internal evasion works and shows up as a tradeoff. Penalize drift at the readout layer and recovery falls to the floor while behavior survives. But only the layer-specific penalties get genuine stealth; an outputs-only penalty only reaches low recovery by collapsing the behavior, and you'll only catch that by looking at what the model actually generates.

The thread running through all of it: an NLA audit trusts that the activation it reads is a faithful window into the model. Every attack here works by breaking that one assumption — forge the activation, leave the activation exposed, or sand the activation flat. If I were deploying one of these as a real auditing tool I'd want three things before I trusted it: score explanation faithfulness with the AR or some reconstruction check rather than the AV text alone, probe more than one layer so a single constrained readout can't blind me, and treat a suspiciously base-like internal stream as its own red flag.