Your engineering org wants an AI slop registry


AI coding instruments don’t simply assist engineers write code quicker. They assist engineers make the identical mistake quicker, at scale, throughout each PR that touches a given sample. I’m not speaking about AI code that’s clearly mistaken; I’m speaking about code that compiles, passes primary checks, and appears believable however is subtly mistaken, bloated, or misaligned with what was truly wanted.

In observe, that often seems to be like AI overengineering the abstraction layer for an issue that wants 10 strains, code that ignores your repo’s patterns, naming, or structure, calls to APIs that don’t exist, or copying patterns with out understanding why, like retry logic the place it’s not wanted.

Errors like which might be systematic, which is what makes them preventable.

You’ve CLAUDE.md and Abilities, however…

Most groups reply to this by making an attempt to provide the AI higher directions. They doc their requirements in a CLAUDE.md file, configure Abilities, and describe the conventions they need the mannequin to comply with. That is the correct impulse, but it surely doesn’t at all times work. 

They’re asking the identical non-deterministic agent that generated the code to additionally catch its personal errors. It could comply with the principles you’ve written. It could not. There’s no proof both means, no audit path, and you’ll’t know upfront which run you’ll get. A CLAUDE.md file is an enter to era. It isn’t a verification system.

“A CLAUDE.md file is an enter to era. It isn’t a verification system.”

Catching slop reliably requires one thing structurally separate: a system that independently checks the output, makes use of a unique agent, and produces the identical consequence each time it sees the identical code.

The 2 layers of verification

The shift we’ve been working towards at Aviator is replacing code review with verified intent. As an alternative of a reviewer studying a diff and asking, “Does this look proper?” the workforce agrees on what the code is meant to do earlier than it’s written, and a separate verification system checks the output in opposition to that settlement.

Consider it like a constructing inspection. A constructing isn’t permitted by an architect watching each nail get hammered. It’s permitted by an inspector evaluating the completed construction in opposition to the blueprints. Intent-driven verification follows the identical sample: the spec is the blueprint, the agent’s implementation is the development, the verifier pipeline produces verdicts and proof for every criterion, and the reviewer approves based mostly on intent match and proof high quality.

“As an alternative of a reviewer studying a diff and asking, ‘Does this look proper?’ the workforce agrees on what the code is meant to do earlier than it’s written.”

The mannequin has two layers, and understanding why there are two fairly than one is the important thing to creating it work.

Consumer standards are the acceptance standards for a particular change, generated by the agent from the expressed intent or written by hand. They’re scoped to this PR solely. The endpoint path, the response form, the habits below failure, and what’s explicitly out of scope. That is the place the intent for a selected activity lives.

Invariant criteria come from the workforce’s Invariants catalog and are guidelines that robotically apply to each matching change. The place user-supplied acceptance standards describe what this variation ought to do, invariants describe what each change ought to respect. They dwell in your account and replace as soon as for everybody.

Your Invariants ought to be particular concerning the rule however imprecise concerning the implementation:

  • All HTTP handlers should name an authentication middleware earlier than any enterprise logic.
  • “All migrations should declare a down block.”

These are outlined as soon as and checked on each run. Builders don’t want to incorporate them within the specs as a result of the system robotically hundreds the matching set.

The take a look at for selling a examine to an invariant is recurrence: something that you just submit in a overview remark a number of instances ought to grow to be an invariant. Aviator truly does this robotically. It auto-creates invariants based mostly on previous feedback.

When verification runs, each layers are assembled right into a single listing of acceptance standards and stream by way of the identical pipeline. A spec including a subscription standing endpoint would possibly include these person standards:

# Add subscription standing endpoint

## Acceptance Standards

– [ ] Endpoint: GET /api/v1/subscription/standing

– [ ] Response contains: standing, renewal_date

The invariant catalog then provides its personal standards robotically, say, a rule that each one HTTP handlers should use AuthMiddleware. Verification checks all of them:

  • ✓ Endpoint exists on the appropriate path (person criterion)
  • ✓ Response contains standing, renewal_date (person criterion)
  • ✓ Handler makes use of AuthMiddleware (invariant)

All should cross. The spec creator didn’t want to recollect the authentication requirement. It was enforced by the catalog with out anybody asking for it.

Invariants because the anti-slop registry

Invariants are what we name the ‘anti-AI slop registry,’ and that makes this work at scale. They tackle the most typical class of AI slop: conference blindness, deprecated APIs, module boundaries the mannequin doesn’t learn about, and safety baselines that ought to apply in all places. None of those are within the mannequin’s coaching information in your particular codebase. They dwell within the heads of your senior engineers and present up as recurring overview feedback.

Most invariants price writing begin as a overview remark that’s been left greater than twice. Right here is an instance of turning an actual overview remark into an invariant:

Touch upon PR #4173:

“Please don’t write to customers immediately — undergo UserRepository.UpdateProfile. We had a partial-write bug final quarter from an analogous sample.”

Invariant physique:

Copy
Writes to the customers desk should undergo UserRepository. Direct INSERT,
UPDATE, or DELETE statements in opposition to the customers desk are usually not allowed
outdoors the repository bundle. Schema migrations below src/db/migrations are exempt.

Situations: file_path_glob: src/**/*.go (skip non-Go recordsdata).

Class: functional_correctness.

You may mine historic overview feedback, cluster them, and generate invariant candidates for human approval. Every invariant you codify is a examine that may by no means value a reviewer time once more.

I’ll have stated that code overview is a historic approval gate that not matches the form of engineering work or that we will cease studying the code, however that won’t occur in a single day. In observe, over time, we transfer the human judgment upstream, the place it’s extra invaluable. Not all the pieces must be reviewed to the identical depth. People ought to overview specs, plans, constraints, and acceptance standards, not 500-line diffs.

The opposite factor that units this aside from a guidelines file is what occurs on the time of verification. The writing agent and the verifying agent are completely different. They don’t share context, they don’t share blind spots, and the verifier produces a structured report per criterion — file references, reasoning, cross/fail/partial — not a gut-check opinion from the identical mannequin that wrote the code.

What we constructed, and what it discovered

At Aviator, we just lately ran an experiment to check the intent-driven verification approach: what if the overview occurs earlier than the code is written?

As an alternative of AI writing code and engineers reviewing it, the workforce hung out writing and reviewing scope, acceptance standards, and edge instances earlier than any implementation began. Then we handed it to an AI agent and let it construct.

The consequence was about 6,000 strains of code. A second agent then verified the output in opposition to the 65 person standards gadgets within the spec. It took six minutes. 60 handed, 4 failed, and 1 was partial. 

“You’re not constructing software program anymore. You’re constructing the machine that builds software program, and high quality management is a part of that machine.”

Human reviewers nonetheless discovered issues, however design-level choices have been verified earlier than any code was generated, and org invariants have been enforced robotically all through.

As an alternative of leaving the identical remark for the fifteenth time, you’re figuring out the sample, writing it as soon as, and letting the system implement it on each change that follows. You’re not constructing software program anymore. You’re constructing the machine that builds software program, and quality control is part of that machine.


Group Created with Sketch.