MirrorCode: Claude Opus 4.6 reimplemented a 16000-line Go bioinformatics toolkit that would take humans 2-17 weeks
·jack-clark.net
METR and Epoch AI released MirrorCode, a benchmark that tests whether AI can autonomously reimplement complex real-world software from specification. The headline result: Claude Opus 4.6 successfully reimplemented gotree — a bioinformatics toolkit with roughly 16000 lines of Go and 40+ commands — an effort estimated to take a human engineer 2 to 17 weeks. The benchmark spans 20+ programs across Unix utilities, cryptography and compression. The release also previews a Google DeepMind taxonomy of six attack genres on AI agents (content injection, semantic manipulation, cognitive state, behavioral control, systemic, human-in-the-loop) and Ryan Greenblatt's revised estimate that full AI R&D automation by end-2028 now has 30% probability, up from 15%, citing verifiable-software-task self-improvement loops.
metrepoch-aiclaudebenchmarkai-safety
Why it matters
MirrorCode is the first benchmark that operationalizes 'software self-reimplementation' as an R&D-automation proxy — and Claude Opus 4.6 just cleared it on a 16,000-line codebase. If the 30% by end-2028 estimate calibrates even directionally, two things shift: recursive self-improvement loops stop being theoretical and start being a scheduling risk for every lab, and the DeepMind six-attack taxonomy becomes the de-facto agent threat model that enterprises must test against. Expect regulated buyers (banks, defense) to demand MirrorCode-style evals in procurement by Q3.
Primary-source newsletter from Jack Clark (Anthropic policy lead) summarizing real papers by METR, Epoch AI and Google DeepMind, with named authors (David Krueger, Ryan Greenblatt). Specific numeric claims (16,000 lines, 2-17 weeks, 15% to 30%) are verifiable in the linked research.
Converge Bio closed a $25M oversubscribed Series A led by Bessemer Venture Partners with TLV Partners, Saras Capital and Vintage Investment Partners participating; execs from Meta, OpenAI and Wiz joined as individual LPs. The company builds generative models trained on DNA, RNA and protein sequence data, with three commercial systems: antibody design, protein yield optimization, and biomarker/target discovery. Traction: 40+ programs with over a dozen pharma/biotech customers across the US, Canada, Europe and Israel, expanding into Asia. Case studies include a 4-to-4.5x protein-yield boost in a single computational pass, and antibodies with single-nanomolar binding affinity. Headcount grew from 9 in Nov 2024 to 34 today. Prior seed was $5.5M in 2024. CEO Dov Gertz founded the company.
NVIDIA released Nemotron OCR v2 on April 17 — an 84M-parameter unified multilingual OCR model trained primarily on 12.2 million synthetic images generated via a modified SynthDoG pipeline, plus ~680K real-world scans. It handles English, Simplified and Traditional Chinese, Japanese, Korean and Russian in a single model (no language detection needed). On OmniDocBench it processes 34.7 pages per second on a single A100 — 28x faster than PaddleOCR v5's server mode at 1.2 pages/s — while holding competitive normalized-edit-distance accuracy. On the SynthDoG multilingual benchmark it dominates: 0.046 NED in Japanese vs v1's 0.723, 0.047 in Korean vs v1's 0.923. Weights and the training dataset are public under NVIDIA Open Model License and CC-BY-4.0.
Mistral shipped MCP support inside Studio on April 16, giving developers both pre-configured connectors and the ability to point agents at any remote MCP server. Built-in connectors cover GitHub, Gmail and web search out of the box, and Mistral now hosts a directory of 20+ secure enterprise connectors spanning data, productivity, development and commerce — Databricks, Snowflake, Atlassian, Asana, Outlook, Box, Stripe, Zapier and more. Custom MCPs are wired through API/SDK with direct tool calling and human-in-the-loop approval gates. All connectors work across model calls and agent calls, with programmatic CRUD over the connector inventory.