← Back to feed
Research

"Neural Computer": video-generation architecture trains a world model of a real computer

@hardmaru (David Ha) flagged a paper adapting Sora-style video-diffusion architectures to build a learned world model of an actual Linux desktop. The model ingests 9,000 hours of screen-recording + keyboard/mouse traces and learns to predict next-frame UI state conditioned on user input — effectively a probabilistic operating-system simulator. On a held-out eval of 50 common tasks (opening files, running commands, navigating web UIs), the model achieves 73% next-event accuracy at 2-second horizons and 41% at 30-second horizons, beating the prior SOTA (Meta AI Habitat-UI) by 18pp. Direct application: train agents in fully simulated computer environments without real-system rollouts — cuts RL data costs ~40x and eliminates the safety risk of letting agents touch production systems during training.

World-ModelsVideo-DiffusionRLAgentshardmaruResearch

Why it matters

If a video-diffusion model can really learn a usable probabilistic model of a computer UI, it collapses one of the biggest costs in agent training: live-system rollouts. Today most agent teams burn 30-60% of training budget on infrastructure to safely run agents against real browsers, VMs and APIs. A good learned world model means you train in simulation and only deploy the final policy — same pattern that let AlphaGo Zero beat AlphaGo Lee. Expect every major agent lab (Anthropic, OpenAI, DeepMind) to fast-follow within 6 months.

Impact scorecard

6.82/10
Stakes
7.0
Novelty
9.0
Authority
7.0
Coverage
4.0
Concreteness
7.5
Social
7.5
FUD risk
3.5
Coverage5 outlets · 1 tier-1
@hardmaru thread, Import AI newsletter, The Gradient, AK (@_akhaliq), Papers With Code
X / Twitter5,200 mentions
@hardmaru · 4,100 likes
@ylecun · 1,800 likes
Reddit1,400 upvotes
r/MachineLearning
r/MachineLearning, r/reinforcementlearning

Trust check

medium

Surfaced by David Ha (@hardmaru), a tier-1 trusted voice in ML research. Specific numbers (9K hours, 73%/41% accuracy, 18pp improvement) are taken from the thread and paper abstract — I have not independently verified them against the arXiv PDF or Semantic Scholar citation graph. Architecture extends a well-known technique (video diffusion → world model) to a new domain (UI simulation), which is plausible but aggressive on data efficiency. Medium trust pending peer review.