OpenAI ships GPT-5.4 — 75% on OSWorld-V, above the 72.4% human baseline
·LLM Stats
OpenAI shipped GPT-5.4 on April 6: a 1M-token context window, sub-200ms TTFT on short prompts, and autonomous multi-step workflow execution across software environments. On OSWorld-V — a benchmark that has the model operate a real desktop end-to-end — it scored 75%, decisively above the 72.4% human baseline. Sam Altman framed it on stage as 'AI as a reliable coworker, not a clever chat tool.' Available via API and ChatGPT Pro; a 'GPT-5.4 Mini' tier hits free users on April 20 with the same agentic scaffolding.
OpenAIGPT-5Agentic AIOSWorld-VBenchmark
Why it matters
Crossing the human baseline on an end-to-end desktop-agent benchmark is the symbolic tipping point from 'AI as chat tool' to 'AI as autonomous coworker'. Enterprise buying decisions — which were stalling on unreliability — now have empirical cover. Expect agentic workflows to become the default integration pattern within 12 months and shift competitive pressure onto Anthropic and Google to match or exceed OSWorld-V.
Impact scorecard
9/10
Stakes
9.0
Novelty
9.0
Authority
9.0
Coverage
9.5
Concreteness
9.0
Social
9.5
FUD risk
2.0
Coverage60 outlets · 12 tier-1
New York Times, Wall Street Journal, Financial Times, Bloomberg, Reuters, The Verge, …
OpenAI primary announcement + independent benchmark replication by SWE-Bench team + broad tier-1 coverage with concrete numbers. FUD risk low; one mild caveat: OSWorld-V was partially designed by OpenAI contributors, so treat 75% as optimistic.
@hardmaru (David Ha) flagged a paper adapting Sora-style video-diffusion architectures to build a learned world model of an actual Linux desktop. The model ingests 9,000 hours of screen-recording + keyboard/mouse traces and learns to predict next-frame UI state conditioned on user input — effectively a probabilistic operating-system simulator. On a held-out eval of 50 common tasks (opening files, running commands, navigating web UIs), the model achieves 73% next-event accuracy at 2-second horizons and 41% at 30-second horizons, beating the prior SOTA (Meta AI Habitat-UI) by 18pp. Direct application: train agents in fully simulated computer environments without real-system rollouts — cuts RL data costs ~40x and eliminates the safety risk of letting agents touch production systems during training.
EE Times deep-dive on AMD's ROCm 7.0 and whether it can finally dent NVIDIA's CUDA moat. AMD's MI400 (96GB HBM4, 5.2 PFLOPS FP8) now runs PyTorch, vLLM and SGLang out-of-the-box — but reviewers testing MLPerf Inference v5.1 still see 1.6–2.2x gaps vs H200 on representative LLM workloads, driven by kernel-library maturity rather than raw silicon. Breakthrough of the cycle: AMD hiring 600 CUDA-kernel engineers in 12 months, plus open-sourcing HIPify tooling that auto-translates 83% of typical CUDA kernels. AMD claims Meta, Microsoft and OpenAI are all now shipping production MI400 pods. NVIDIA's response: CUDA 13 with tensor-core autotuning targeting the same eval suite, launching Q2.
Anthropic announced the advisor strategy on the Claude Platform: pair Opus 4.6 as a planning/critique advisor with Sonnet 4.6 or Haiku 4.5 as the executing model. The advisor inspects partial outputs, suggests corrections and redirects the executor mid-generation. On SWE-bench Multilingual, Sonnet+Opus-advisor scores 2.7 percentage points higher than Sonnet alone, at roughly 1.3x the cost vs 7x the cost of running Opus end-to-end. General availability today via the Claude Console and CLI; pricing is existing Claude API rates for both models (no advisor premium). Anthropic positions this as the first first-class multi-model inference primitive in any frontier-lab API — not just routing or cascading but explicit advisor/executor roles with shared context.