← Back to feed
Research

Berkeley SPEX: GPT-4o mini fails 92% of trolley problems — replacing 4 words reduces failure to near zero

Researchers at UC Berkeley (Landon Butler, Justin Singh Kang, Yigit Efe Erginbas, Abhineet Agarwal, Bin Yu, Kannan Ramchandran) published SPEX on March 13, 2026 — a signal-processing + coding-theory approach that scales LLM feature-interaction discovery from dozens to thousands of components. The benchmark anecdote: on a standard trolley problem task, GPT-4o mini failed 92% of the time; SPEX identified four specific words whose replacement dropped failure rates to near zero. A variant called ProxySPEX achieves equivalent identification with roughly 10x fewer ablations. The method exploits two empirical properties — sparsity (few interactions actually matter) and low-degreeness (each interaction involves small feature subsets) — to make interpretability tractable at frontier-model scale.

berkeleyinterpretabilityllmmmluablations

Why it matters

Interpretability at scale has been stuck — most methods work on dozens of features or break on frontier-size models. SPEX is the first technique that credibly identifies interactions in the thousands while maintaining faithfulness, and ProxySPEX's 10x compute reduction makes it practical to run as a production audit layer. The trolley-problem 92%-to-zero result is the kind of shareable hook that pulls the AI-safety community onto a new toolchain. Expect SPEX-style audits to show up in red-team reports for Claude Mythos-class models by Q3, and for model cards to begin citing SPEX interaction graphs alongside benchmark scores.

Impact scorecard

7.6/10
Stakes
8.0
Novelty
9.0
Authority
9.0
Coverage
5.5
Concreteness
9.0
Social
6.5
FUD risk
2.0
Coverage8 outlets · 1 tier-1
Berkeley BAIR, The Gradient, Import AI, MarkTechPost
X / Twitter3,400 mentions
@binyu_stats · 2,200 likes
Reddit820 upvotes
r/MachineLearning
r/MachineLearning

Trust check

high

Primary-source Berkeley BAIR blog with named academic authors (Bin Yu is an NAS-elected statistician, Kannan Ramchandran is an IEEE Fellow). Reproducible via code release. No FUD flags.

Primary source ↗