Autonomous AI Coding Clears 60,000-Line Ceiling: MirrorCode Benchmark Launched


Epoch AI and AI safety organization METR published the full results of MirrorCode on June 26, 2026 — a benchmark that solutions a query the sector has been unable to measure cleanly: how a lot autonomous software program engineering can an AI mannequin do, unsupervised, when given sufficient time and compute? The reply, as of this week, is greater than the prior technology of benchmarks recommended. Claude Opus 4.7 reimplemented gotree — a bioinformatics toolkit of roughly 16,000 traces of Go code with greater than 40 instructions — in 14 hours at a value of $251. Epoch AI estimates {that a} human engineer working with out AI help would wish two to seventeen weeks for a similar process. In a individually reported outcome that didn’t seem within the benchmark’s preliminary April launch, Opus 4.7 additionally reimplemented pkl, a configuration programming language with roughly 60,000 traces of code — the biggest autonomous coding achievement documented in any public analysis thus far.

The complete launch, which incorporates the benchmark paper, outcomes from a number of frontier fashions, and an open-source scaffold protecting 22 of the benchmark’s 25 goal packages, marks a major growth of what the April 10 preliminary publication had proven. The place the April outcomes targeted on Claude fashions examined in opposition to smaller packages, the June launch introduces head-to-head outcomes throughout Claude Opus 4.7, OpenAI’s GPT-5.5, and Google’s Gemini 3.1 Professional Preview — and raises the ceiling on what “solved” means.

How MirrorCode Works: No Supply Code, No Web, No Human Assist

MirrorCode presents AI programs with 25 compiled packages drawn from actual software program tasks spanning Unix utilities, information serialization, bioinformatics, interpreters, static analyzers, cryptographic libraries, and compression algorithms. The mannequin receives solely this system’s compiled binary and its documentation — no supply code, no web entry, and no human steering in the course of the run. It should then autonomously write new supply code that reproduces the unique program’s conduct precisely.

The analysis mechanism that makes this verifiable is a sandboxed black-box oracle: the AI can ship arbitrary inputs to the reference binary and observe precisely what the unique program outputs — together with stdout, stderr, and exit codes, matched byte-for-byte. That is the specification. The AI should then construct a reimplementation that passes not simply the seen check instances offered firstly, however a set of held-out end-to-end checks it by no means sees throughout growth. Success requires passing these hidden checks at a really excessive threshold — the benchmark treats 99–100% as the usual for a profitable reimplementation.

What separates MirrorCode from each different software program engineering benchmark in present use is the inference funds. Commonplace evaluations — together with SWE-bench, which dominates the present leaderboard panorama — cap per-task spending at roughly one to 10 {dollars}, which corresponds to runs lasting minutes or at most just a few hours. MirrorCode imposes no such constraint. The most costly single process within the full launch ran for 19 days of steady computation at a value of $2,600. That scale will not be uncommon in actual software program engineering — it’s how the benchmark ensures the comparability to human time is honest.

Claude Leads at 56%, GPT-5.5 Trails at 44%

Throughout all 25 goal packages, Claude Opus 4.7 achieved a 56% clear up price — outlined as efficiently reimplementing a program in a minimum of one benchmark run on the 100% threshold. OpenAI’s GPT-5.5 adopted at 44%, and Google’s Gemini 3.1 Professional Preview got here in at 32%.

These clear up charges replicate the stricter 100% test-pass threshold. At a looser 99% threshold — which Epoch AI treats as a near-perfect reimplementation — 21 of the 25 goal packages have been solved a minimum of as soon as throughout all examined fashions. The excellence issues: on gotree, for instance, Claude Opus 4.7’s finest run handed 2,000 of two,001 checks, failing solely a single edge case involving a distinct segment command for manipulating date annotations. The benchmark counts that as not absolutely solved, however the researchers describe it as a near-perfect reimplementation protecting primarily all scoped performance.

All three fashions reliably clear up the small packages — Unix utilities equivalent to uuid and parseqsv which might be well-defined and brief. The medium tier is the place the fashions diverge. Eight goal packages have by no means been solved on the 100% threshold by any mannequin in any run — the benchmark’s hardest remaining set.

The price image is much less orderly than the efficiency image. GPT-5.5 prices roughly thrice as a lot as its predecessor GPT-5 to unravel the identical duties, whereas Claude Opus 4.7 runs roughly thrice cheaper than Claude Opus 4.1 for equal work. Epoch AI doesn’t establish a single trigger for the divergence throughout suppliers; the connection between mannequin functionality and inference price stays unsettled throughout the business.

What a 60,000-Line Autonomously Constructed Codebase Truly Seems to be Like

The gotree reimplementation drew the benchmark’s consideration within the April preliminary outcomes, however the full June launch provides a extra putting outcome: Claude Opus 4.7’s reimplementation of pkl, a configuration programming language totaling roughly 60,000 traces of code. Pkl will not be a utility program — it’s a full interpreter for a lazily evaluated language designed to interchange JSON and YAML in advanced configuration workflows. Implementing it accurately requires dealing with lazy analysis (the place properties are computed when accessed, not when outlined), parsing from a behavioral specification moderately than readable supply code, and passing a check suite drawn partly from real-world utilization examples measuring tens of 1000’s of characters per case.

The pkl process is among the benchmark’s hardest, and earlier mannequin generations — together with Claude Opus 4.6, which was the highest performer within the April preliminary outcomes — couldn’t clear up it. In Epoch AI’s documentation of the April run, Opus 4.6 dedicated to keen analysis in its first draft of the pkl interpreter, then spent the rest of the run trying to work round that architectural mistake moderately than abandoning it. Opus 4.7’s potential to clear this process represents a qualitatively completely different conduct from its predecessor: it succeeded the place the prior technology stalled on a elementary design choice.

AI Doubled Its Remedy Price in One 12 months

Certainly one of MirrorCode’s specific design objectives is to stay helpful as a benchmark whilst AI capabilities advance — to be onerous sufficient that it doesn’t saturate in months the way in which most present software program engineering benchmarks do. The complete launch gives a year-on-year comparability that illustrates how shortly the underlying capabilities are transferring.

Main fashions from roughly one yr earlier than the June 2026 launch would have scored round 30% on MirrorCode, based on Epoch AI. At that stage, they have been restricted to the less complicated packages within the suite — utilities just like the calendar software, whose misleading complexity (byte-exact copy of historic calendar conduct, together with September 1752, when Britain switched from the Julian to the Gregorian calendar and dropped eleven days) nonetheless yielded to the sooner technology of fashions.

The leap from roughly 30% to 56% in a single yr — whereas the goal set expanded and process issue grew — is in step with the tempo of functionality positive factors that analysis researchers at Epoch AI and METR have documented throughout different long-horizon coding evaluations. METR’s Might 2026 Frontier Threat Report famous that brokers performing nicely on MirrorCode-style duties profit particularly from the quick suggestions loop: a mannequin can run the reference binary, observe whether or not its output matches, and alter its implementation constantly — a construction that amplifies the returns from longer inference runs.

The Ceiling: What Each Mannequin Nonetheless Can’t Do

Eight MirrorCode packages stay unsolved on the 100% test-pass threshold in spite of everything runs throughout all examined fashions. The complete launch’s heatmap reveals the sample clearly: packages that fall above a sure dimension and complexity threshold have by no means been solved in any run. The most important duties, which Epoch AI estimates would take human engineers not weeks however months of sustained work, stay past the attain of each mannequin examined.

Even in failure, the fashions sometimes move greater than 90% of a program’s check suite earlier than stalling on edge instances. What the benchmark registers as “failure” is usually shut: a mannequin that passes 95% of a program’s checks will not be incompetent on the process — it’s lacking the particular dealing with that separates a near-complete reimplementation from a totally verified one. The 100% threshold is intentionally strict as a result of something much less permits a mannequin to move checks with out having genuinely reconstructed this system’s full conduct.

Specification Completeness: The Constraint That Defines What This Truly Proves

MirrorCode’s structure rests on a situation that the researchers flag explicitly and that each reader evaluating its labor-market implications should weigh: it really works as a result of the specification is full and checkable. The AI receives a reference binary it might probably question with any enter, complete documentation, and a check suite that covers this system’s behavioral floor. Given all of that, the mannequin can work indefinitely, checking its personal progress in opposition to a verifiable normal till it succeeds or the funds runs out.

Most actual software program growth doesn’t provide this construction. A function request in a manufacturing codebase comes with ambiguous necessities, evolving stakeholder expectations, and implicit constraints which might be nowhere written down. A migration process includes legacy conduct that’s partially documented and partially tribal information. MirrorCode presupposes that the toughest half — deciding what the software program ought to do, resolving the ambiguities, and establishing the check suite that proves it — is already achieved. The engineering labor it automates is execution, not judgment.

This isn’t a trivial portion of engineering work. Execution, together with implementation, debugging, iteration, and the flexibility to maintain architectural choices over 1000’s of traces and dozens of hours, is genuinely troublesome and time-consuming. MirrorCode gives the clearest proof thus far that AI can deal with a lot of it. However the benchmark’s personal paper is cautious to notice that the generalization to actual software program growth is unsure. The abilities that MirrorCode doesn’t measure — specification writing, stakeholder communication, architectural judgment below ambiguity, and the flexibility to detect when a acknowledged requirement is definitely improper — are exactly the abilities which might be hardest to automate and most useful when automation handles the remaining.

Why the Benchmark Issues Past Leaderboard Rankings

MirrorCode is co-developed with METR, an AI security analysis group whose major analysis agenda considerations measuring the autonomous capabilities of frontier AI programs — to not rank fashions commercially, however to grasp when and whether or not AI can carry out duties that may beforehand have required sustained human professional judgment.

That framing offers MirrorCode a twin function. As a functionality benchmark, it gives a managed and independently verifiable measure of how far autonomous coding has superior. As an analysis software for alignment analysis, it provides a setting by which researchers can examine how AI programs plan, debug, and iterate over multi-day autonomous runs — together with how they reply to lifeless ends, once they resolve to restart versus patch, and whether or not their decision-making degrades over lengthy inference spans.

The implication the benchmark makes accessible is particular and falsifiable: for software program tasks with full, testable specs, AI can already carry out autonomous engineering work that may take a human crew weeks. For software program tasks with out that specification construction — which describes most software program — the identical AI programs are measurably much less succesful, and human judgment stays the binding constraint.


Regularly Requested Questions

Will AI change software program engineers?

MirrorCode’s outcomes don’t help a blanket reply, and the benchmark’s personal authors warning in opposition to one. What the info reveals is that AI can now autonomously full software program implementation duties that may take a human engineer two to seventeen weeks — however solely when the duty comes with a whole specification and a checkable check suite. Most software program engineering work doesn’t begin from that place. Engineers who outline specs, resolve ambiguous necessities, and make architectural choices below uncertainty are doing the work that MirrorCode doesn’t measure — and that work is exactly what the benchmark’s success presupposes somebody has already achieved.

How is MirrorCode completely different from SWE-bench, the usual AI coding benchmark?

SWE-bench offers AI fashions entry to an current codebase and asks them to repair a particular bug or implement a particular function, with the repair validated in opposition to the repository’s current checks. MirrorCode offers fashions a compiled binary with no supply code and asks them to rebuild your entire program from scratch, validated in opposition to a held-out check suite the mannequin by no means sees throughout growth. SWE-bench duties are solved in minutes with spending below ten {dollars} per run. MirrorCode permits days of steady computation and 1000’s of {dollars} in inference price. The 2 benchmarks measure completely different capabilities: SWE-bench measures code comprehension and focused modifying; MirrorCode measures autonomous long-horizon implementation.

What’s long-horizon AI coding, and why does it want its personal benchmark?

Present benchmarks measure AI coding efficiency on duties that take seconds to minutes of compute — sometimes bug fixes, perform implementations, or focused modifications to an current codebase. Lengthy-horizon coding refers to duties that may take a human professional days to weeks of sustained work: constructing a whole, novel program from a specification. The excellence issues as a result of the abilities required to succeed at long-horizon duties — sustained coherent planning throughout 1000’s of traces, the flexibility to detect and repair deep architectural errors, and the capability to iterate over days with out human check-ins — are qualitatively completely different from single-turn code technology. MirrorCode is the primary public benchmark particularly designed to measure these capabilities, with inference budgets scaled to match the time a human would spend.

Can the MirrorCode outcomes be trusted, provided that AI fashions could have seen the unique supply code throughout coaching?

That is the central limitation the researchers acknowledge and the one that almost all instantly constrains how far outcomes will be extrapolated. As a result of MirrorCode makes use of actual open-source packages as targets, AI fashions are prone to have encountered the unique code throughout pretraining. Epoch AI addressed this by working a memorization display — testing whether or not a mannequin’s efficiency on a particular goal seems to depend on recall of the supply code moderately than real reimplementation. In packages that handed the memorization display, AI efficiently reimplemented them; in packages that confirmed indicators of memorization, efficiency diverse extra. The researchers conclude that memorization doesn’t dominate the outcomes however explicitly state they can not rule out its contribution. The open-source launch of the benchmark and its scaffold permits unbiased researchers to breed and scrutinize the findings.

The complete MirrorCode paper, open-source scaffold, and benchmark outcomes can be found at epoch.ai/MirrorCode.