Epoch AI's new MirrorCode benchmark tests whether AI models can recreate entire programs on their own. Claude Opus 4.7 leads with 56 percent, but every model still fails on the most complex tasks.
In the new MirrorCode coding benchmark from Epoch AI and METR, AI models have to reimplement complete programs from scratch without access to the original source code.
The 25 target programs cover Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each AI-generated solution must exactly reproduce the output of the original program, including hidden end-to-end tests the model never sees during development.
Another difference from many other benchmarks is the inference budget. Existing software engineering benchmarks often cap costs at $1 to $10 per task, even when a human would need weeks to finish the same work, the developers write.
According to Epoch AI, one of the largest tasks in MirrorCode cost $2,600 for a single run. The AI worked continuously for 19 days with no human involvement at all.
Epoch AI says AI can already handle demanding long-term programming tasks. The standout example comes from Claude Opus 4.7, which reimplemented gotree, a bioinformatics toolkit with roughly 16,000 lines of Go code and over 40 commands. A human engineer working without AI help would need 2 to 17 weeks for the same job, according to the researchers. Opus 4.7 finished in 14 hours for $251.
In the overall rankings, Claude Opus 4.7 hit a solve rate of 56 percent. GPT-5.5 followed at 44 percent, and Gemini 3.1 Pro Preview came in at 32 percent. Even when models fail to fully reimplement a program, they typically pass 90 percent or more of the tests.
Despite the progress, MirrorCode is far from solved. Tasks fall into three categories: small, medium, and large. Small programs like uuid or parseqsv get reliably reimplemented by all tested models. The largest tasks beat every model tested.
The researchers are still seeing rapid gains. Leading models from a year ago would have scored only about 30 percent and been limited to simpler programs like a calendar utility, Epoch AI says.
Cost trends don't follow a clear pattern. GPT-5.5 costs three times as much as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than Claude Opus 4.1.
Epoch AI has open-sourced the scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages. Three programs are kept private for testing.
The researchers point to one important caveat: since MirrorCode uses open-source programs as targets, the models may have already seen the original code during training. Initial tests suggest "the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance," they write.