Andrej Karpathy Has Renamed Vibe Coding. Right here’s What Engineering Leaders Have to Do About It.


On the one-year anniversary of coining “vibe coding,” Andrej Karpathy proposed changing it with “agentic engineering.” The excellence he drew was exact: vibe coding is describing what you need and accepting what comes again. Agentic engineering is designing the system, specifying the constraints, and utilizing AI to speed up implementation you may have already reasoned by means of. One is expression. The opposite is engineering.

Most software program organizations are working each concurrently and calling them the identical factor. That’s the place the costly errors are coming from.

One in every of my growth leads put it plainly — not as a coverage place, however as an empirical remark. In his expertise, vibe-coded PRs constantly arrive lacking edge case dealing with, error paths, and exception logic. Not as a result of the AI forgot them.; it’s as a result of the developer by no means specified them. They described an consequence, accepted what the agent produced as a result of it seemed proper, and submitted it. The exams go as a result of they had been written in opposition to the code that exists, not in opposition to the habits the system really requires.

The agent didn’t make one thing up. The developer didn’t know what to ask for.

His response is to not reject AI coding instruments. It’s to require that engineers display they perceive what was generated — the sting circumstances, the scaling assumptions, the failure modes — earlier than the PR will get merged. Should you can not clarify why the answer is designed the way in which it’s, you didn’t design it. You accepted it.

He’s proper. And the information backs him up. PR evaluation instances on closely AI-assisted groups are up 91% — not as a result of AI is writing worse code, however as a result of reviewers at the moment are answerable for reconstructing the comprehension that the developer skipped. That may be a tougher evaluation, not a neater one. And it’s compounding.

 What AI Did to the Roles — and What It Didn’t

There’s a widespread assumption amongst know-how leaders that AI coding instruments collapsed the excellence between who builds and who evaluations — that the agent writes properly sufficient that the outdated high quality gates are a legacy of a slower period.

That assumption confuses velocity with comprehension.

The developer, the tester, the architect — these roles had been by no means primarily about producing artifacts. They had been about understanding the system properly sufficient to know when one thing was improper earlier than it grew to become another person’s drawback. The developer who spots a race situation noticed it as a result of they understood the execution mannequin. The tester who asks “what occurs when the person does the surprising factor?” requested it as a result of they reasoned by means of the system’s habits. The architect who acknowledges that this answer works now and can break at scale acknowledged it as a result of they held the entire system of their head.

These aren’t manufacturing duties. They’re comprehension duties. You can not delegate comprehension to an agent.

What modified is you could now produce 100 traces of code with out having achieved the considering {that a} hundred traces of code used to require. The output exists. The understanding behind it could not. An engineer reviewing a vibe-coded PR just isn’t reviewing code — they’re making an attempt to reconstruct whether or not the developer who submitted it really understood what they had been constructing.

The roles aren’t dissolving. They’re being stress-tested. The developer who designed the answer — who can clarify each edge case, each failure mode, each scaling assumption — is extra helpful than earlier than. The one who accepted what the agent produced as a result of it seemed proper and the exams handed is now a legal responsibility on the velocity the group is transferring.

 Three Failure Modes Engineering Managers Have to Watch For

These aren’t hypotheticals. They’re patterns repeating throughout organizations deploying AI coding instruments at scale.

The inexperienced pipeline drawback.  A inexperienced pipeline means the code does what it was requested to do. It doesn’t imply the developer requested the appropriate factor, or requested utterly sufficient. A senior engineer is aware of to look behind the inexperienced. A supervisor who has stepped too removed from the work can not inform from a dashboard whether or not inexperienced means protected or means quick and unexamined.

The lacking path drawback. The developer who doesn’t perceive the system’s failure modes can not specify them. The agent can not floor what the developer didn’t know to require. In a manufacturing system, the joyful path is the place issues work. The sad paths are the place you discover out what the system is definitely manufactured from. AI brokers, as Karpathy famous, had been purpose-built for the primary 80% of an utility — the implementation that flows naturally from a well-described intent. The final 20% — the sting circumstances, the failure restoration, the scaling constraints — requires a developer who has really thought by means of the system. That 20% is the place vibe-coded code constantly runs out.

The arrogance calibration drawback. AI-generated code reads as authoritative. The construction is clear, the naming is coherent, the feedback are current. It doesn’t appear like code written by somebody who was unsure — even when the underlying logic incorporates a wager that one thing won’t ever occur. Human code carries the fingerprints of doubt: the remark that claims “TODO: deal with this case,” the defensive examine that indicators the developer was unsure. AI code usually lacks these indicators. Reviewers have to produce the doubt themselves. That requires judgment the reviewer can solely train in the event that they perceive the system properly sufficient to know what to doubt.

 What Engineering Leaders Have to Do Otherwise

There’s a model of technical management that sounds subtle and is quietly harmful on this setting: the supervisor who has stepped again from the code to concentrate on supply metrics, who measures the AI program by velocity numbers and adoption charges, and who interprets a senior engineer’s insistence on deep code evaluation as resistance to alter.

That supervisor is optimizing for the output of the method reasonably than the standard of the judgment being utilized to it. In a fast-moving AI setting, that may be a compounding error.

Technical proximity just isn’t micromanagement. It’s not writing code or reviewing each PR. It’s being shut sufficient to the precise habits of the programs you’re accountable for you could inform the distinction between a group transferring quick as a result of they’re disciplined and a group transferring quick as a result of they skipped the exhausting half.

The supervisor who can not learn a PR doesn’t have to evaluation each one. However they should perceive what their senior engineers search for after they do. That distinction — between “this handed the exams” and “that is proper” — just isn’t obtainable from a abstract. It’s obtainable from contact.

My group runs three rituals that don’t have anything to do with standing updates and every part to do with sustaining that contact.

Two hours each week in an structure working session. Two hours each different week in dash planning. Two hours every dash demoing to the entire group.

The structure periods are the place the system’s reasoning lives — not the tickets, not the documentation, however the residing dialog about why issues are designed the way in which they’re and what the choices had been that weren’t taken. A supervisor who sits in these periods for six months builds a working mannequin of the system that no dashboard can replicate.

Dash planning is the place the disconnects floor. We use planning poker — everybody estimates independently earlier than the reveal. When estimates diverge sharply, the dialog that follows is sort of at all times essentially the most helpful one of many dash. Not as a result of we’re negotiating a quantity. As a result of divergent estimates imply divergent psychological fashions. Somebody thinks this activity is a 2. Another person thinks it’s a 13. That hole just isn’t a disagreement about effort. It’s proof that two individuals are not wanting on the similar drawback.

Divergent estimates don’t measure complexity. They measure the place your group’s understanding of the system breaks down.

The demos preserve everybody sincere about what was really constructed versus what was meant, cross-train the group throughout what every individual is engaged on, and provides the supervisor an important sign of all: whether or not the folks constructing the system can clarify what they constructed and why the tradeoffs they made had been proper.

An AI agent can produce a demo. It can not clarify its reasoning underneath questioning. The engineers who can are those you can not afford to route round.

 Karpathy’s reframe from vibe coding to agentic engineering just isn’t a terminology replace. It’s a skilled obligation.

The organizations that ignore AI will fall behind. Those that vibe it’ll ship failure at scale. Those that engineer it — intentionally, with comprehension at each layer — are those constructing one thing value working in manufacturing.

That’s not a productiveness dialog. That may be a accountable AI dialog. The code appears to be like completed. The pipeline is inexperienced. The PR is open.

Whether or not it’s really prepared continues to be a human name. Ensure your group — and also you — are shut sufficient to the work to make it.