yogesh

The Pragmatic Interpretability Trap

TLDR: Pragmatic interp sounds great in the sense that you get to keep interp tools while actually moving safety metrics, but looking a bit closer it's kinda a trap. You pay interp's overhead but get judged against black-box baselines that don't, so the work that survives is whatever cleared that bar, not whatever produced understanding. The two scoreboards (understanding vs intervention) don't actually run side by side; the metric one eats the other, and stuff like NLAs end up failing both, you can't trace which part of an activation drove the explanation and the verbalizer can hallucinate the cognition you're trying to monitor. Either commit to understanding the machine or commit to moving the metric.


There's a worldview I've been seeing more of lately and the view I want to critique on - "A Pragmatic Vision for Interpretability", the case that the field should pivot from ambitious mech interp to pragmatic interp, with progress measured through empirical feedback on safety relevant proxy tasks. I think the appeal is real, but it's quietly broken in a way worth being explicit about.

I get the appeal: Pure mech interp has been this slow burn bet that genuinely understanding neural nets will eventually pay off in ways you can't specify in advance, and that's a hard sell when safety pressure is high and timelines are short, also mech interp isn't necessarily the answer to solving safety/alignment. So you get this hybrid pitch: keep using interp flavored tools (probes, SAEs, steering vectors, activation oracles), but evaluate success by whether a chosen downstream metric actually moves.

The three camps as I see this as: I feel that there are three main stances/worldviews I have formulated and believe exist after reading this (inspired by the 2x2 matrix in the appendix), one is the full mech interp world view where the goal is to just fully understand neural networks (white box understanding) and that we are betting on that genuine full understanding eventually pays out in ways you can't fully specify in advance. The second is pure pragmatism where you make the model perform better at a specific, chosen metric x, here you don't really care about how the model achieves the goal, as long as the numbers go up. The most important step here is choosing the right goal (metric x), and then relentlessly optimizing for it using whatever brute-force or "black-box" methods. The third is the hybrid world view, where you are using interp tools (looking inside the model, working with activations, steering vectors, SAEs, probes etc.) but evaluating success by whether a chosen metric moves. This is the the world view that is mostly endorsed in the post, which I am not fully sold on.

My critique: is that with this hybrid worldview, that there is a trap where interp flavored work fails at pragmatism (loses to dumber baselines) and fails at interp (chases metrics instead of cognition). I felt like the post read as to unify this to a hybrid world view which I don't really think is complete. In other words, the trap is that you do interp styled work (so you pay interp's overhead), but you're judged like a pragmatist (so you can't pursue understanding for its own sake). You get the worst of both, slower than pure pragmatism at moving metrics, and prevented from doing real science because metrics decide which research lines continue.

My analogy: the way I like to think about this is in terms of two scoreboards that interpretability work can be judged against. One scoreboard measures understanding: did we learn something real about how the model represents, computes, or generalizes? The other measures intervention, did the method improve a chosen downstream metric, such as reducing deception, suppressing eval awareness, improving monitoring? Both of them matter but they are not the same, a method can improve a metric while not teaching us about the models internals/cognition, and conversely a method can reveal meaningful insights and structure without beating simple baselines on a safety task. Pragmatic interpretability, is a framework that runs both at once.

My worry: is that running both at once tends to collapse into running just the second. The collapse is when "The metric moved" becomes the only legible form of progress, and "we learned something about the model" stops being a separate question, it's just whatever the metric implied. The cost is asymmetric: interp work pays for the overhead of using internals but gets judged against black-box baselines that don't, so the surviving research is whatever cleared that bar, not whatever produced understanding.

An example: the recent NLA paper I think is a concrete instance of that worldview in motion.

The thing that made this more concrete for me is in their own limitations section. They say:

The AV is fundamentally a black box, non-mechanistic method... it is not possible to distinguish which part of an activation led the AV to produce an explanation. Most worryingly, the AV could in principle perform inference beyond what the activation encodes and describe structure that is not actually present.

So we're using an LLM to verbalize another LLM's activations, and the verbalizer can describe cognition that isn't there. And once you take it seriously, NLAs kinda fail at both scoreboards in the long term. They fail at understanding because by construction you can't trace which part of an activation drove which part of an explanation, so there's no white-box claim left to defend. And they might fail at safety because the explanation can claim the model was "reasoning about being evaluated" when the activation doesn't actually encode that. This might be the result of hallucinating the cognition you're trying to monitor.

Activation Oracles raise a similar concern. They are interp flavored, since they operate over activations and answer questions about internal model states. But they are not obviously mechanistic explanations. They are closer to a learned interface over activations. That may be useful, but usefulness is not the same as understanding.

So overall: I am not saying that that pragmatic interpretability is useless or incoherent. Its just that it needs to be honest about which scoreboard it is optimizing. I feel that you either commit to understanding the machine, or commit to moving the metric to some extent.