When AI Grades Its Personal Coding Homework

Agoda’s new APAC expertise headquarters occupies flooring 27 to 33 of Tower 5 at One Bangkok, a LEED Platinum-certified actual property improvement mission beneath the Frasers Property group and Frasers Property (Thailand) Public Firm Restricted. Twenty-six thousand sq. meters, almost 4,000 folks, a community operations heart for real-time platform monitoring, struggle rooms constructed for coordinated incident response, and content material studios in a holistically built-in district situated on the nook of Wi-fi and Rama 4 Roads in Bangkok.

It’s the sort of infrastructure that belongs in a telecom or an air visitors authority, not an organization that books resort rooms. That is the place Idan Zalzberg, the platform’s chief expertise officer, has spent the final 4 years fixing an issue that a lot of the trade hasn’t but admitted it has.

“We’re basically at 100% utilization of AI-assisted coding. No person is coding anymore. No person is writing Python or Java — that apply is gone.”

This stopped being a fringe declare a while in the past. Andrej Karpathy — OpenAI cofounder and the person who coined vibe coding — has described his own manual skills quietly atrophying, saying he now principally applications in English. Leaders at OpenAI and Anthropic say the identical about their very own groups. The builders stay; the typing stopped.

The stress migrates

The issue with eliminating a bottleneck at one stage is that it reappears elsewhere.

Telemetry compiled by Faros AI from more than 10,000 developers throughout 1,255 groups discovered that heavy AI adopters merged 98% extra pull requests, whereas the time spent reviewing them ballooned by 91%.

Squeeze one stage of any pipeline and the stress doesn’t vanish. It migrates, often to someplace no one was watching. Agoda hit this straight. Code was arriving quicker than any human staff may learn it.

The repair was to construct an AI code reviewer. It now handles over half of all code changes, with a 92% developer satisfaction price.

Sit with that.

Who checks the checker

An AI writes the code. A special AI checks the code. And generative programs — as Zalzberg is the primary to acknowledge — aren’t deterministic: they produce totally different outputs from similar inputs. Feed the identical immediate twice, and the mathematics adjustments relying on how requests are batched on the server. Similar enter; totally different output. Even when randomness is theoretically disabled.

So, we requested Zalzberg the query this construction logically produces: when the factor checking the work is similar kind of factor that did the work, and each are probabilistic, what’s your floor fact?

“So, sadly, I don’t suppose there’s a single absolute floor fact in code overview,” Zalzburg explains. He factors out that people had been by no means constant both — the identical engineer reviewing the identical code on a worse day catches one thing totally different, or misses it fully. The bottom fact was all the time partly a conference.

Idan Zalzberg CTO Agoda — **Idan Zalzberg @ Agoda:** “Personally, I spend far more time speaking about instances the place somebody took an additional minute to suppose once more or look once more, and far much less time speaking about individuals who merely bought a whole lot of code merged.”

Having disbursed with certainty, Agoda constructed one thing it may truly measure: layered, verifiable belief grounded in three distinct mechanisms. Each change to the AI reviewer is rerun towards a whole lot of previous code evaluations to examine whether or not it nonetheless flags what it ought to and isn’t silently creating blind spots. Each important mannequin replace ships as a blind experiment: half the builders obtain the brand new model, and none of them know which. And to battle the variance baked into probabilistic output, every code change will get reviewed not as soon as however 4 occasions, the verdicts folded right into a single end result. “We discovered that this reduces variance significantly,” Zalzberg says, “with a value and latency trade-off that we’re prepared to make.”

Pattern till the noise cancels out. It’s an engineering reply to what’s, at its root, a philosophical downside. And for Agoda, it really works, proper as much as the purpose the place you ask what “works” means.

The arrogance entice

There may be proof that confidence in AI tooling shouldn’t arrive too simply. A randomized controlled trial by METR (Mannequin Analysis and Menace Analysis), with 16 skilled builders, 246 actual duties and a run from February to June 2025, discovered that AI coding instruments made these builders 19% slower, whereas the builders themselves reported feeling 20% quicker. The hole between notion and measured efficiency was not minor, and it endured throughout each statistical lower the researchers tried. The builders didn’t really feel slower; they felt like they had been flying.

Zalzberg’s counter isn’t an algorithm. He has spoken publicly about wanting engineers who problem AI output relatively than merely obtain it, treating a generated end result as a draft relatively than a conclusion. The second a developer thinks they’re finished is, he’s argued, probably the most harmful level in any AI-assisted workflow. Defending that intuition means rewarding the pause in a enterprise the place each incentive pushes towards delivery. “Personally, I spend far more time speaking about instances the place somebody took an additional minute to suppose once more or look once more,” he says, “and far much less time speaking about individuals who merely bought a whole lot of code merged.”

Efficiency evaluations that commemorate hesitation, at an organization beneath actual aggressive stress, will not be a small ask.

What Agoda truly is

Which returns you to the constructing, and to a query Zalzberg doesn’t totally shut. Agoda received the BEHAVIOR Challenge at NeurIPS 2025 — a Stanford-run robotics benchmark with no apparent connection to reserving motels. The platform processes close to a million tokens per second. Its safety staff has, Zalzberg mentions, efficiently penetrated each industrial AI chatbot they’ve examined. At what level does a journey platform grow to be one thing else?

Zalzberg has thought of it. “Know-how is barely attention-grabbing to us when it serves that goal.” He additionally provides, “Our aim is to not take away vacationers from decision-making. It’s to take away the elements of journey that really feel like work. The brokers we construct are designed to suggest, clarify, and execute with consent, to not make selections silently on a traveler’s behalf. “

But the reasoning that constructed all of this began 4 years earlier from a distinct sort of readability. The week GPT-3.5 grew to become accessible, he started constructing. Over time, we constructed the governance layers, visibility, controls, and most significantly, the abilities and practices throughout the staff. The logic was sensible: If you happen to give a buyer an AI software that doesn’t work, they don’t blame the AI software however the firm.

Primarily, show it inside first — what Zalzberg calls his “inside out” AI technique. Earn the appropriate to ship it outward. Measure the whole lot twice — or, if mandatory, 4 occasions.

That philosophy was designed to guard a journey firm’s fame. On the twenty seventh ground of One Bangkok and above, it has grow to be the working logic of one thing bigger, and the class it belongs to continues to be being labored out.

Picture credit score: iStockphoto/Anton Vierietin