MirrorCode: Claude Opus 4.6 reimplemented a 16000-line Go bioinformatics toolkit that would take humans 2-17 weeks
·jack-clark.net
METR and Epoch AI released MirrorCode, a benchmark that tests whether AI can autonomously reimplement complex real-world software from specification. The headline result: Claude Opus 4.6 successfully reimplemented gotree — a bioinformatics toolkit with roughly 16000 lines of Go and 40+ commands — an effort estimated to take a human engineer 2 to 17 weeks. The benchmark spans 20+ programs across Unix utilities, cryptography and compression. The release also previews a Google DeepMind taxonomy of six attack genres on AI agents (content injection, semantic manipulation, cognitive state, behavioral control, systemic, human-in-the-loop) and Ryan Greenblatt's revised estimate that full AI R&D automation by end-2028 now has 30% probability, up from 15%, citing verifiable-software-task self-improvement loops.
metrepoch-aiclaudebenchmarkai-safety
Why it matters
MirrorCode is the first benchmark that operationalizes 'software self-reimplementation' as an R&D-automation proxy — and Claude Opus 4.6 just cleared it on a 16,000-line codebase. If the 30% by end-2028 estimate calibrates even directionally, two things shift: recursive self-improvement loops stop being theoretical and start being a scheduling risk for every lab, and the DeepMind six-attack taxonomy becomes the de-facto agent threat model that enterprises must test against. Expect regulated buyers (banks, defense) to demand MirrorCode-style evals in procurement by Q3.
Primary-source newsletter from Jack Clark (Anthropic policy lead) summarizing real papers by METR, Epoch AI and Google DeepMind, with named authors (David Krueger, Ryan Greenblatt). Specific numeric claims (16,000 lines, 2-17 weeks, 15% to 30%) are verifiable in the linked research.
Kronos (AAAI 2026 accepted, arxiv 2508.02739) is the first open-source foundation model pre-trained on financial candlestick (K-line) sequences. A specialized tokenizer quantizes multi-dimensional OHLCV data into hierarchical discrete tokens; a decoder-only autoregressive transformer is pre-trained on 12B (12 billion) K-line records from 45 global exchanges. Results against the leading time-series foundation model (TSFM) and best non-pretrained baseline: 93% higher RankIC on price-series forecasting over TSFM and 87% over the non-pretrained baseline; 9% lower MAE on volatility forecasting; 22% improvement in generative fidelity for synthetic K-line sequences. Model, weights, and demo are open on GitHub (shiyu-coder/Kronos) — repo is currently GitHub-trending.
Google Research published Simula in Transactions on Machine Learning Research (April 16, 2026): a framework that reframes synthetic data generation as mechanism design, using reasoning-driven construction rather than sample-level optimization. The team (Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous) generated datasets of up to 512K (512,000) data points across five domains — cybersecurity (CTI-MCQ, CTI-RCM), legal reasoning (LEXam), math (GSM8k), and multilingual knowledge (Global MMLU). Results show 'better data scales better': a 10% accuracy gain on math reasoning using Gemini 2.5 Flash as teacher and Gemma-3 4B as student. The four-step recipe is global diversification → local diversification → complexification → quality checks. Complexification helped math but hurt legal reasoning — the paper warns mechanism design is domain-dependent.
coleam00/Archon is a TypeScript open-source workflow harness that makes AI coding deterministic and repeatable through YAML-defined development processes. Hit 18.8k GitHub stars and is trending weekly. Latest release v0.3.6 on April 12, 2026 with 1,265 commits on dev branch. It ships 17 default workflows covering issue fixes, feature development, PR reviews, and refactoring. Core features: isolated execution (each run gets its own git worktree for parallel conflict-free processing), composable workflows (mix deterministic nodes like bash/tests/git with AI-powered steps like planning/code-gen/review), multi-platform (CLI, Web UI, Slack, Telegram, Discord, GitHub webhooks), and human gates (interactive approval steps). MIT licensed, requires Bun + Claude Code + GitHub CLI.