AI Solves 56% of Weeks-Lengthy Coding Initiatives in New Benchmark: MirrorCode


Autonomous AI coding has crossed a threshold that the majority software program engineers didn’t count on to see this 12 months: a brand new benchmark launched Friday by Epoch AI and METR discovered that as we speak’s greatest mannequin, Claude Opus 4.7, can efficiently reconstruct whole software program initiatives that will take a human engineer weeks to finish — with out ever seeing the supply code, with out human intervention, and with none entry to the web. On MirrorCode, a 25-program long-horizon coding analysis, Claude Opus 4.7 solved 56% of targets, together with a 16,000-line bioinformatics toolkit that 4 impartial engineers estimated would take a talented human 2 to 17 weeks to reimplement.

The benchmark marks the primary rigorous, reproducible, multi-model demonstration that AI brokers can maintain goal-directed software program improvement throughout job horizons beforehand studied solely by formal strategies researchers pursuing the decades-old dream of automated program synthesis.

What Makes MirrorCode Totally different From Each Different Coding Benchmark

Most AI coding benchmarks — together with the extensively cited SWE-bench — measure how effectively a mannequin can repair a single bug in an current codebase, or implement a small characteristic given the complete supply code. These duties usually resolve in minutes and price a greenback or two in inference compute. They measure whether or not a mannequin can carry out a wise, context-bounded motion.

MirrorCode measures one thing structurally completely different: whether or not a mannequin can reconstruct the complete habits of a program it can not learn. The AI receives solely a compiled binary, pure language documentation, and a set of instance input-output pairs. It may well run the binary with arbitrary inputs to look at what it does — a setup the researchers name a “black-box oracle” — however it can not see the supply code, entry the web, or obtain human steering in the course of the run. Each resolution should produce byte-exact output on each the take a look at circumstances the mannequin may see and a separate set of held-out checks it couldn’t see, guaranteeing there isn’t any path to gaming the benchmark via memorization or lookup tables.

The domains lined within the full launch span the breadth of working software program: Unix utilities, knowledge serialization and question instruments, bioinformatics toolkits, language interpreters, static analyzers, cryptography implementations, and compression utilities. Fashions may implement their options in any of six languages: Python, C, Rust, Go, OCaml, or Ada.

How the Scoring Works: Byte-Precise Output, Hidden Assessments, No Dishonest

The benchmark’s technical design addresses essentially the most persistent drawback in AI coding analysis: distinguishing real competence from memorization.

As a result of MirrorCode duties contain reimplementing actual open-source packages, the fashions nearly actually encountered these codebases in pretraining. To stop memorization from producing a false optimistic, the benchmark separates checks into seen and hidden units. A mean of 34% of checks per goal are held out — by no means proven to the AI throughout its run. An answer passes provided that it produces byte-exact matches on each units concurrently. Median take a look at depend per goal: 601 particular person input-output circumstances.

The paper features a memorization display screen: fashions have been prompted to breed authentic supply features verbatim. The researchers discovered a baseline similarity rating of 0.34 — which means fashions weren’t predominantly retrieving memorized code — although the authors acknowledge memorization can’t be absolutely dominated out and count on any quantitative inflation wouldn’t change the directional discovering.

The infrastructure provides three additional guardrails: fashions can not wrap the reference binary to imitate its outputs (the mannequin’s code is copied to a separate sandbox the place the unique binary is absent throughout scoring); fashions can not intrude with the scoring mechanism (scoring runs in an remoted atmosphere and requires string equality); and mannequin code can not entry the web throughout a run.

The Hardest Job: 19 Days, $2,600, One Shot

The benchmark’s most excessive knowledge level illustrates how removed from typical benchmarking this work sits. One of many 25 goal packages required $2,600 in inference compute for a single try and saved the mannequin working constantly for 19 uninterrupted days. Epoch AI notes that is the price of real elicitation: most current software program engineering benchmarks cap inference spending at $1 to $10, even for duties researchers estimate would take a human weeks. At that funds, the mannequin by no means will get a good likelihood on the hardest packages.

The standout success story is gotree, a bioinformatics toolkit with roughly 16,000 traces of Go code and greater than 40 instructions. Claude Opus 4.7 reimplemented it in 14 hours, passing 2,000 of two,001 checks — 99.95% — at a value of $251. The one failing take a look at lined an edge case for a distinct segment date-annotation command. The researchers describe the reimplementation as successfully full for all sensible functions.

For comparability, main AI fashions from eight months in the past would have scored roughly 30% on the identical benchmark and have been restricted to less complicated targets like a calendar utility. GPT-5.5 positioned second general, and Gemini 3.1 Professional Preview positioned third with roughly 32%.

The place AI Nonetheless Fails: Architectural Limits on the Largest Packages

The 56% headline rating obscures a significant sample inside the information. Benchmark packages fall into three casual dimension tiers, and the outcomes differ sharply throughout them. Small packages are solved reliably by all examined fashions. Medium packages are solved by the main fashions in a minimum of some runs. Giant packages — together with Pkl, a configuration language interpreter with 61,461 traces of code — defeated each mannequin examined.

The Pkl failure is technically instructive. Throughout a run that consumed roughly 1 billion tokens of inference and price roughly $550, Claude Opus 4.6 accurately recognized that this system required a lazy analysis structure. The mannequin by no means carried out the required rewrite. With 770 million tokens nonetheless accessible, it continued iterating on the fallacious architectural basis. That particular failure — right analysis, absent structural refactoring — represents a concrete, documented ceiling of present agentic techniques moderately than a normal limitation of the underlying mannequin’s reasoning.

David Rein, a METR researcher and co-author of the benchmark, famous after the preliminary ends in April that MirrorCode could already be approaching saturation. On 21 of 25 goal packages, a minimum of one mannequin has handed 99% of checks or extra. Eight targets have by no means been absolutely solved in any single run at 100%, however the problem is concentrated in a small variety of onerous edge circumstances moderately than elementary functionality absence.

The Specification Hole: What This Means for Actual-World Software program Engineering

The researchers are exact about what MirrorCode does and doesn’t show. The benchmark’s design requires one thing that’s genuinely uncommon in actual software program improvement: a exact, programmatically checkable specification backed by a whole lot of take a look at circumstances and an executable reference implementation. In an expert software program undertaking, that specification often doesn’t exist at first; it emerges via iteration with stakeholders, customers, and product managers over time.

The benchmark demonstrates AI functionality at execution, not at requirement discovery. A mannequin that may reconstruct a 16,000-line bioinformatics toolkit from its observable habits is demonstrating sustained architectural planning, iterative debugging, and tolerance for ambiguity throughout hours of uninterrupted work — qualitatively completely different from fixing a bug or producing a operate. However it isn’t the identical as being handed an ambiguous transient and producing manufacturing software program from scratch.

The researchers body this as a helpful certain moderately than a limitation: MirrorCode establishes what AI can do when the specification drawback is solved. The remaining open query — how effectively AI performs when the specification itself have to be found via stakeholder collaboration — is the following frontier the benchmark will not be designed to measure.

What the Full Launch Contains

Epoch AI and METR have open-sourced the benchmark scaffold and 22 of the 25 goal packages, masking 132 job situations throughout the six supported implementation languages. The remaining three packages are held again as a personal take a look at set to protect analysis integrity as new fashions arrive. A leaderboard is now live at epoch.ai/MirrorCode the place researchers can submit new fashions for analysis.

The MirrorCode paper is authored by Tom Adamczewski and David Owen of Epoch AI, and David Rein of METR, with extra job contributions from Florian Model, Giles Edkins, Allen Hart, and Daniel O’Connell.

The June 26 Epoch Transient additionally included two extra analysis gadgets: an evaluation of hyperscaler capital expenditure trajectories displaying that main cloud suppliers — together with Microsoft, Amazon, Alphabet, Meta, and Oracle — are on tempo to spend past their working money flows earlier than the tip of 2026; and a taxonomy of greater than 60 distinct duties in frontier AI analysis and improvement, designed to trace which components of AI analysis stay unautomated.


Ceaselessly Requested Questions

What’s the MirrorCode benchmark and the way does it work?

MirrorCode is a long-horizon coding benchmark developed by Epoch AI and METR that asks AI fashions to reconstruct actual software program packages with out entry to the unique supply code. The mannequin receives solely a compiled binary it may possibly run, pure language documentation, and instance input-output take a look at circumstances. Options should produce byte-exact outputs on each seen and hidden take a look at circumstances, making it inconceivable to recreation the benchmark via memorization or lookup tables. The 25 goal packages span Unix utilities, bioinformatics, cryptography, interpreters, and different domains, with options applied in any of six languages.

How does MirrorCode differ from SWE-bench?

SWE-bench provides a mannequin the complete supply code of an current undertaking and asks it to repair a particular bug, with most duties resolving in minutes. MirrorCode provides the mannequin solely an opaque binary and asks it to reconstruct this system’s whole habits from scratch — no supply code, no web entry, no human steering. The place SWE-bench measures focused restore functionality, MirrorCode measures sustained, architect-level development throughout time horizons of hours to weeks.

Can AI substitute software program engineers primarily based on MirrorCode outcomes?

Not on the idea of this benchmark alone. MirrorCode requires one thing uncommon in actual improvement: a exact, programmatically checkable specification with a whole lot of take a look at circumstances and an executable reference implementation. Skilled software program engineering usually begins with out that stage of specification readability. What MirrorCode establishes is that when the specification drawback is solved, AI can deal with the execution at an expert engineer’s scale — weeks of coding work — autonomously. The remaining open query is how AI performs when specs are ambiguous, evolving, and require stakeholder negotiation.

What are the engineering limits MirrorCode revealed?

The benchmark uncovered a particular architectural ceiling: AI techniques can accurately diagnose {that a} program requires a specific structure — reminiscent of lazy analysis in an interpreter — however fail to carry out the structural rewrite wanted to implement it, even when given substantial extra inference funds. That is distinct from a normal reasoning failure; it’s a documented hole in how present agentic techniques deal with large-scale architectural refactoring mid-attempt.