← Back to feed
Research

MirrorCode: Claude Opus 4.6 reimplemented a 16000-line Go bioinformatics toolkit that would take humans 2-17 weeks

METR and Epoch AI released MirrorCode, a benchmark that tests whether AI can autonomously reimplement complex real-world software from specification. The headline result: Claude Opus 4.6 successfully reimplemented gotree — a bioinformatics toolkit with roughly 16000 lines of Go and 40+ commands — an effort estimated to take a human engineer 2 to 17 weeks. The benchmark spans 20+ programs across Unix utilities, cryptography and compression. The release also previews a Google DeepMind taxonomy of six attack genres on AI agents (content injection, semantic manipulation, cognitive state, behavioral control, systemic, human-in-the-loop) and Ryan Greenblatt's revised estimate that full AI R&D automation by end-2028 now has 30% probability, up from 15%, citing verifiable-software-task self-improvement loops.

metrepoch-aiclaudebenchmarkai-safety

Why it matters

MirrorCode is the first benchmark that operationalizes 'software self-reimplementation' as an R&D-automation proxy — and Claude Opus 4.6 just cleared it on a 16,000-line codebase. If the 30% by end-2028 estimate calibrates even directionally, two things shift: recursive self-improvement loops stop being theoretical and start being a scheduling risk for every lab, and the DeepMind six-attack taxonomy becomes the de-facto agent threat model that enterprises must test against. Expect regulated buyers (banks, defense) to demand MirrorCode-style evals in procurement by Q3.

Impact scorecard

7.9/10
Stakes
9.0
Novelty
9.0
Authority
8.0
Coverage
5.5
Concreteness
9.0
Social
7.0
FUD risk
2.0
Coverage8 outlets · 1 tier-1
Import AI, METR Blog, Epoch AI, MarkTechPost
X / Twitter5,600 mentions
@jackclarkSF · 6,200 likes
@METR_Evals · 3,100 likes
Reddit540 upvotes
r/MachineLearning
r/MachineLearning, r/ControlProblem

Trust check

high

Primary-source newsletter from Jack Clark (Anthropic policy lead) summarizing real papers by METR, Epoch AI and Google DeepMind, with named authors (David Krueger, Ryan Greenblatt). Specific numeric claims (16,000 lines, 2-17 weeks, 15% to 30%) are verifiable in the linked research.

Primary source ↗