Protection at AI velocity: Microsoft’s new multi-model agentic safety system tops main {industry} benchmark

Immediately Microsoft introduced a serious step ahead in AI-powered cyber protection: our new agentic safety system helped researchers discover 16 new vulnerabilities throughout the Home windows networking and authentication stack—together with 4 Vital distant code execution flaws in elements such because the Home windows kernel TCP/IP stack and the IKEv2 service. They used the brand new Microsoft Safety multi-model agentic scanning harness (codename MDASH) which was constructed by Microsoft’s Autonomous Code Safety workforce. In contrast to single-model approaches, the harness orchestrates greater than 100 specialised AI brokers throughout an ensemble of frontier and distilled fashions to find, debate, and show exploitable bugs end-to-end.

The outcomes communicate for themselves: 21 of 21 planted vulnerabilities discovered with zero false positives on a non-public take a look at driver; 96% recall towards 5 years of confirmed Microsoft Safety Response Heart (MSRC) instances in clfs.sys and 100% in tcpip.sys; and an industry-leading 88.45% rating on the general public CyberGym benchmark of 1,507 real-world vulnerabilities—the highest rating on the leaderboard, roughly 5 factors forward of the subsequent entry.

The strategic implication is obvious: AI vulnerability discovery has crossed from analysis curiosity into production-grade protection at enterprise scale, and the sturdy benefit lies within the agentic system across the mannequin somewhat than any single mannequin itself. Codename MDASH is being utilized by Microsoft safety engineering groups and examined by a small set of consumers as a part of a restricted personal preview.

This publish explains how codename MDASH works, what we shipped at this time, what we discovered alongside the way in which, and how one can join the personal preview.

AI-powered vulnerability discovery at hyper-scale

The Microsoft Autonomous Code Safety (ACS) workforce was assembled to take AI-powered vulnerability analysis from a analysis curiosity to manufacturing engineering at enterprise scale. A number of members of this workforce got here to Microsoft from Workforce Atlanta, the workforce that gained the $20 million DARPA AI Cyber Problem by constructing an autonomous cyber-reasoning system that discovered and patched actual bugs in advanced open-source tasks. The teachings from that work, particularly the extent of engineering required to make the frontier language fashions carry out professional-level safety auditing, are what our new multi-model agentic scanning harness (codename MDASH) is constructed round.

Microsoft’s code base is difficult for safety auditing for a couple of causes:

Huge proprietary floor. Home windows, Hyper-V, Azure, and the device-driver and repair ecosystems round them are personal Microsoft codebases—not a part of any commodity language mannequin’s coaching corpus, and genuinely arduous to cause about: kernel calling conventions, IRP and lock invariants, IPC belief boundaries, and component-internal idioms don’t yield to sample matching. On this floor, a mannequin has to really cause.

DevSecOps at scale. Each discovering has an actual proprietor, a triage course of, and a Patch Tuesday to land on. There isn’t a quiet drawer for speculative findings; if a device produces noise, the noise is everybody’s downside.

Excessive-value targets. Home windows, Hyper-V, Xbox, and Azure serve billions of customers. The payoff for locating a single arduous bug is unusually excessive—and so is the price of a false optimistic in a tier-one element.

The findings on this publish are the results of shut collaboration between ACS and Microsoft Home windows Assault Analysis and Safety (WARP). WARP owns the deep, arduous finish of Home windows offensive analysis; ACS brings the AI-powered discovery and validation pipeline. Collectively, the groups have collaborated to construct a mature harness.

Codename: MDASH—Microsoft Safety’s new multi-model agentic scanning harness

Codename MDASH is, at its core, an agentic vulnerability discovery and remediation system. The mannequin is one enter. The system is the product.

A helpful psychological mannequin is to think about it as a structured pipeline that takes a code base and emits validated, confirmed findings:

Put together stage: Ingests the supply goal, builds language-aware indices, after which attracts the assault floor and menace fashions by analyzing the previous commits.
Scan stage: Runs specialised auditor brokers over candidate code paths, emitting candidate findings with hypotheses and proof.
Validate stage: Runs a second cohort of brokers—debaters—that argue for and towards every discovering’s reachability and exploitability.
Dedup stage: Collapses semantically equal findings (for instance, patch-based grouping).
Show stage: Constructs and executes triggering inputs the place the bug class admits it. The show stage validates the pre-condition dynamically and formulates the bug-triggering inputs to show existence of vulnerability (for instance, ASan in C/C++).

Three properties make this work in follow:

An ensemble of numerous fashions which can be successfully managed by codename MDASH. No single mannequin is greatest at each stage. The multi-model agentic scanning harness runs a configurable panel of fashions. That features SOTA fashions because the heavy reasoner, distilled fashions as a cheap debater for high-volume passes, and a second separate SOTA mannequin as an unbiased counterpoint. Disagreement between fashions is itself a sign: when an auditor flags one thing as suspect and the debater can’t refute it, that discovering’s posterior credibility goes up.

Specialised brokers. An auditor doesn’t cause like a debater, which doesn’t cause like a prover. Every pipeline stage has its personal function, immediate regime, instruments, and cease standards. We don’t anticipate one immediate to do every little thing; we don’t anticipate one agent to acknowledge, validate, and exploit a bug in a single move. Codename MDASH has greater than 100 specialised brokers, constructed by means of deep analysis with previous widespread vulnerabilities and exposures (CVEs) and their patches, working independently to find the bugs, and their auditing outcomes will likely be ensembled as a single report.

Finish-to-end pipeline with extensible plugins. The pipeline is opinionated, however it isn’t closed. Plugins let area consultants inject context the muse fashions can’t see on their very own—kernel calling conventions, IRP guidelines, lock invariants, IPC belief boundaries, codec state machines. The CLFS proving plugin we describe beneath is one such instance: a website plugin that is aware of the right way to assemble a triggering log file given a candidate discovering. For instance, the Home windows workforce prolonged reasoning with customized code evaluation database, or CodeQL database can be additionally leveraged.

The payoff for this structure is portability throughout mannequin generations. The pipeline’s concentrating on, validation, dedup, and show levels are mannequin agnostic by development, which permits the harness to get one of the best of what any mannequin has to supply. When a brand new mannequin lands, A/B testing it towards the present panel is one configuration flip. When a mannequin improves, the client’s prior funding—scope information, plugins, configurations, calibrations—all carry over, permitting clients to trip the frontier of safety worth.

Utilizing codename MDASH for safety analysis

To judge bug-finding capabilities of the multi-model agentic scanning harness it’s worthwhile to first floor on code that has by no means been seen by a mannequin. This eliminates the likelihood {that a} mannequin “discovered the solutions to the take a look at.” We scanned StorageDrive, a pattern gadget driver utilized in Microsoft interviews for offensive safety researchers. The driving force accommodates 21 intentionally injected vulnerabilities, together with kernel use-after-frees (UAFs), integer dealing with points, IOCTL validation gaps, and locking errors. As a result of StorageDrive is a non-public codebase that has by no means been revealed, we will safely assume it was not included within the coaching knowledge of contemporary language fashions.

We ran the harness on StorageDrive utilizing its default configuration. The outcomes had been placing: all 21 ground-truth vulnerabilities had been accurately recognized, with zero false positives on this run.

This straightforward take a look at exhibits that the reasoning and vulnerability discovery capabilities of codename MDASH can approximate skilled offensive researchers.

We then use the harness to conduct safety auditing of essentially the most security-critical a part of Home windows, specifically, TCP/IP community stack.

The 5.12.2026 Patch Tuesday cohort

Throughout the Home windows community stack and adjoining providers, at this time’s Patch Tuesday consists of 16 CVEs our engineering groups discovered utilizing codename MDASH.

Part	Description	CVE	Severity	Kind
tcpip.sys	Distant unauth SSRR IPv4 packets inflicting UAF	CVE-2026-33827	Vital	Distant Code Execution
tcpip.sys	NULL deref by way of crafted IPv6 extension headers	CVE-2026-40413	Essential	Denial of Service (DoS)
tcpip.sys	Kernel DoS by way of ESP SA refcount underflow	CVE-2026-40405	Essential	Denial of Service
ikeext.dll	Unauth IKEv2 SA_INIT double-free triggers LocalSystem RCE	CVE-2026-33824	Vital	Distant Code Execution
tcpip.sys	Use-after-free in Ipv4pReassembleDatagram resulting in disclosure	CVE-2026-40406	Essential	Data Disclosure
tcpip.sys	IPsec cross-SA fragment splicing by way of reassembly	CVE-2026-35422	Essential	Safety Function Bypass
tcpip.sys	Unauthenticated native Home windows Filtering Platform (WFP) RPC disables title cache	CVE-2026-32209	Essential	Safety Function Bypass
ikeext.dll	Reminiscence leak	CVE-2026-35424	Essential	Denial of Service
telnet.exe	Out-of-bounds (OOB) learn in FProcessSB by way of malformed TO_AUTH	CVE-2026-35423	Essential	Data Disclosure
tcpip.sys	IPv6+TCP MDL-split packet triggers NULL deref	CVE-2026-40414	Essential	Denial of Service
tcpip.sys	ICMPv6 packet triggers NdisGetDataBuffer NULL deref	CVE-2026-40401	Essential	Denial of Service
tcpip.sys	Pre-auth distant UAF by way of SA double-decrement	CVE-2026-40415	Essential	Distant Code Execution
http.sys	Unauth distant QUIC control-stream OOB learn	CVE-2026-33096	Essential	Denial of Service
tcpip.sys	Kernel stack buffer overflow by way of RPC blob	CVE-2026-40399	Essential	Elevation of Privilege
netlogon.dll	Unauthenticated CLDAP Person= filter stack overflow	CVE-2026-41089	Vital	Distant Code Execution
dnsapi.dll	Crafted UDP DNS response triggers heap OOB	CVE-2026-41096	Vital	Distant Code Execution

These vulnerabilities are 10 kernel-mode / 6 usermode. The bulk are reachable from a community place with no credentials. Let’s take a more in-depth look.

Two deep dives

The 2 findings beneath are attribute of what the brand new Microsoft Safety multi-model agentic scanning harness pipeline can do {that a} single mannequin harness can not. The primary is a kernel race-condition use-after-free that requires reasoning about object lifetime throughout non-trivial management movement and three unbiased concurrent free paths. The second is an alias-aliasing double-free that spans six supply information and is barely seen towards the distinction of a accurately dealt with web site elsewhere in the identical code base.

CVE-2026-33827—Distant unauthenticated UAF in tcpip.sys by way of SSRR

The vulnerability arises within the Home windows IPv4 obtain path because of improper lifetime administration of a reference-counted Path object inside Ipv4pReceiveRoutingHeader. After invoking a routing lookup, the operate drops its sole owned reference to the Path by means of a dereference operation, however later reuses the identical pointer when dealing with Strict Supply and Document Route (SSRR) processing. As a result of the article’s reference rely may attain zero on the earlier launch level, the underlying reminiscence will be returned to a per-processor lookaside allocator and subsequently reused, turning the later entry right into a classical use-after-free in kernel context.

This happens on a network-triggerable path that processes attacker-controlled packet metadata, making it reachable at elevated IRQL inside the networking stack. The core problem is escalated by the concurrency mannequin of the trail cache and related cleanup routines. As soon as the caller relinquishes possession, the Path object’s liveness relies upon fully on exterior references held by shared knowledge buildings. A number of unbiased subsystems—together with the path-cache scavenger, express flush routines, and interface state-driven rubbish assortment—can concurrently take away the article and drop the ultimate reference. These operations are usually not synchronized with the receive-side execution window on this operate, and no lock is held to serialize entry. In consequence, on SMP techniques the freed object will be reclaimed and overwritten earlier than the following dereference, changing a easy ordering bug right into a race-driven use-after-free with actual execution feasibility.

From an exploitation standpoint, the vulnerability is reachable by a distant, unauthenticated attacker by means of crafted IPv4 packets carrying the SSRR possibility that move commonplace validation checks. The stale pointer dereference can set off a sequence of entry by means of freed reminiscence, doubtlessly resulting in managed reads and a stronger corruption primitive if the reclaimed allocation is attacker-influenced. Though exploitation requires successful a slim timing window and shaping allocator reuse, the mix of distant reachability, kernel execution context, and the potential for managed reminiscence manipulation elevates the difficulty to Vital severity.

Why single-model techniques missed this bug

A single mannequin harness tends to overlook this bug as a result of the lifetime violation shouldn’t be domestically seen even inside the similar operate. The discharge of the Path reference and its later reuse are separated by non-trivial management movement—an alternate department, a number of validation checks, and several other early-drop circumstances—which break the easy “release-then-use” sample most detectors depend on. With out monitoring reference possession throughout these intermediate states, the mannequin sees two unbiased operations somewhat than a temporal dependency. In consequence, the dereference doesn’t look suspicious in isolation, although the reference rely semantics assure the pointer may already be invalid.

The decisive sign additionally lives exterior the instant context. The identical logical operation seems elsewhere with the right order; all wanted knowledge is derived from the article earlier than dropping the reference. This makes this call-site an inconsistency somewhat than an apparent misuse.

Detecting that requires cross-file reasoning: figuring out analogous patterns, aligning their intent, and noticing the deviation. On prime of that, reachability relies on composing a number of circumstances—an enter that units the SSRR flag, default configuration that permits the trail, and concurrent subsystems that may reclaim the article throughout the uncovered window. A single-shot evaluation collapses these steps and loses the interplay between them, whereas a staged strategy can join the possession violation, the concurrency mannequin, and the externally managed set off right into a coherent exploitation path.

Disclosure. CVE-2026-33827, patched in April Patch Tuesday.

CVE-2026-33824: Unauthenticated IKEv2 SA_INIT + fragmentation → double-free → LocalSystem RCE

The vulnerability lived within the IKEEXT service, the Home windows element answerable for IKE and AuthIP keying for IPsec, and was reachable by a distant, unauthenticated attacker over UDP/500 on any host configured as an IKEv2 responder (RRAS VPN, DirectAccess, At all times-On VPN infrastructure, or any machine with an inbound connection safety rule). By sending a crafted IKE_SA_INIT carrying Microsoft’s “IPsec Safety Realm Id” vendor-ID payload, adopted by a single IKEv2 fragment (RFC 7383 SKF) that reassembles instantly, an attacker might set off a deterministic double-free of a 16-byte heap allocation contained in the service.

As a result of IKEEXT runs as LocalSystem inside svchost.exe, this represents a pre-authentication distant code execution path into one of many highest-privilege contexts on the system. The foundation trigger is a textbook possession bug. When IKEEXT reinjects a reassembled fragment again by means of its obtain pipeline, it duplicates the packet’s obtain context with a flat memcpy. This can be a shallow copy: it clones the struct’s bytes however not the heap allocations it factors to. A kind of allocations is the attacker-supplied security-realm identifier, and after the copy, each the queued context and the dwell Fundamental Mode SA maintain the identical pointer, and each consider they personal it.

On teardown, every one frees it, leading to a double-free. The set off sequence is 2 UDP packets, no race, no particular timing. The IKEEXT service runs as LocalSystem in svchost.exe. A double-free of a fixed-size heap chunk is a well-understood corruption primitive in fashionable Home windows; we’re not publishing additional exploitation particulars. Reachability requires that the host has an IKEv2 responder coverage that accepts the proposed transforms—the bug is reachable on RRAS VPN, DirectAccess, At all times-On VPN, and IPsec connection safety guidelines of their typical configurations, however a naked Begin-Service IKEEXT with no responder coverage shouldn’t be susceptible. The IKEEXT service is DEMAND_START by default; the place responder coverage exists, BFE will begin it on the primary inbound IKE packet, so the attacker doesn’t want IKEEXT to already be operating.

Why single-model techniques missed this bug

The bug is an aliasing lifecycle bug spanning six information: ike_A.c (the unhealthy memcpy), ike_B.c (the alias origin and the primary stack-local copy), ike_C.c (the unsuitable free), ike_D.c (each the suitable sample and the second free), ike_E.c (the place the buffer will get populated remotely), and ike_F.c (the IKEv2 dispatcher and the UAF learn web site that precedes the second free). No single-file evaluation sees it. The strongest piece of proof that the bug is actual is the right model of the identical sample, in the identical code base, in ike_D.c—instantly after the memcpy of the selector. Catching this requires the auditor to acknowledge the lacking step at one web site by reference to the current step at one other. Our specialised auditor brokers are designed to floor precisely these comparisons; the talk stage forces them to face up below cross-examination.

Disclosure. CVE-2026-33824, patched in April Patch Tuesday.

How succesful is codename MDASH?

The Patch Tuesday cohort and the StorageDrive are forward-looking alerts. Two retrospective benchmarks inform us how the system performs towards floor reality on actual, well-reviewed code.

Recall on historic MSRC instances. We re-ran codename MDASH towards pre-patch snapshots of two closely reviewed Home windows elements and measured whether or not the historic MSRC-confirmed bugs would have been (re-)found:

clfs.sys: 96% recall on 28 MSRC instances spanning 5 years.
tcpip.sys: 100% recall on 7 MSRC instances spanning 5 years.

These are the strongest inner numbers we publish, and they’re significant for a particular cause: the MSRC case database is the bottom reality for what actual attackers exploited, what required a Patch Tuesday, and what defenders needed to react to. A system that recovers 96% of a five-year MSRC backlog in a closely reviewed kernel element shouldn’t be discovering theoretical weaknesses; it’s discovering the bugs that mattered.

We’re deliberate about what these numbers do and don’t declare. They’re retrospective recall benchmarks on inner code with a finite case rely. They inform us that the system would have been helpful had it existed on the time. They don’t, by themselves, predict that the subsequent 38 bugs in CLFS will likely be discovered on the similar fee. The forward-looking sign is the Patch Tuesday cohort itself.

The CLFS proving extension as a labored instance. The 96% CLFS recall quantity is partly a narrative concerning the show stage. Many CLFS findings look fascinating till you attempt to assemble a triggering log file; a candidate discovering and not using a proof is, in follow, an entry on a triage backlog. The CLFS-specific proving plugin we wrote is aware of the right way to assemble triggering logs given a candidate discovering: it understands the on-disk container structure, the block-validation sequence, and the in-memory state machine nicely sufficient to drive a candidate path to its sink. That is exactly what plugin extensibility is for: the muse fashions don’t, and shouldn’t be anticipated to, internalize Microsoft-specific filesystem invariants. The plugin embeds them, the mannequin makes use of them, and the result is bugs that survive being confirmed, not bugs that get filed and forgotten.

CyberGym. On the general public CyberGym benchmark—a corpus of 1,507 real-world vulnerability replica duties drawn from throughout 188 OSS-Fuzz tasks—the Microsoft Safety multi-model agentic scanning harness reaches an 88.45% success fee, the best rating on CyberGym’s revealed leaderboard on the time of writing and roughly 5 factors above the subsequent entry, 83.1%. This outcome was obtained through the use of usually out there fashions. The robust outcomes recommend that the encircling agentic system contributes considerably to end-to-end efficiency, past uncooked mannequin functionality. For analysis, we used CyberGym’s default configuration (stage 1), which gives the susceptible supply code and a high-level vulnerability description. To interface with CyberGym’s analysis protocol, we prolonged the harnesses show stage to autonomously submit proof-of-concept (PoC) inputs and retrieve flags.

Our failure evaluation of the remaining roughly 12% reveals two notable structural patterns: amongst findings that focused the unsuitable code space, 82% got here from duties with imprecise descriptions that additionally lacked operate or file identifiers, suggesting that description high quality is a significant component in scan accuracy. We additionally discovered instances the place the agent constructed libFuzzer-style inputs, however the benchmark process really required honggfuzz-format inputs, resulting in in any other case sound reproductions failing on harness-format mismatch.

What this all means

We’re at a second within the {industry} the place AI-powered vulnerability discovery stops being speculative and begins being an engineering downside. The findings on this Patch Tuesday and the retrospective recall on 5 years of CLFS MSRC instances are proof that AI vulnerability findings can scale.

What we’ve got discovered constructing MDASH and utilizing it throughout Microsoft is extra moveable: the harness does the work, and the mannequin is one enter.

This issues in three concrete methods.

First, discovery requires composition that no single immediate can obtain. The bugs on this publish—the tcpip.sys race, the ikeext.dll alias chain—are usually not seen to a mannequin handed a single operate. They’re seen to a system that may sequence cross-file sample comparability, multi-step reachability evaluation, debate between specialised brokers, and end-to-end proof development. Single-model harnesses undersold what fashions can do; over-trusted single brokers overshoot what fashions can do reliably. The artwork is the harness across the mannequin, and the harness is many of the engineering.

Second, validation is the distinction between a discovering and a repair. A scanner that flags candidate bugs is a scanner that produces a triage backlog. The Patch Tuesday cohort is what it’s as a result of the system that produced it doesn’t cease at candidate—it debates, dedups, and proves. Validation shouldn’t be a checkbox; it’s its personal pipeline of brokers and plugins, and it’s the place many of the day-over-day engineering finally ends up.

Third, the system absorbs mannequin enhancements, which is what makes it sturdy. When a brand new mannequin lands, the concentrating on, debating, dedup, and proof levels don’t have to be rewritten; we alter a configuration and re-run an A/B take a look at. The client’s funding—per-project context, scan plugins, proving brokers—carries over. That is the architectural property that issues most over time, as a result of the mannequin lottery goes to maintain enjoying out, and any system whose worth is gated on a specific mannequin is a system that must be rebuilt each six months.

For defenders—at any scale, on any code they personal—the implication is identical. The precise query to ask of an AI vulnerability device shouldn’t be which mannequin does it use? however what does it do with the mannequin, and what survives when the subsequent mannequin arrives?

Conclusion

The Microsoft Safety multi-model agentic scanning harness (codename MDASH) helps our engineering groups meaningfully enhance safety outcomes utilizing usually out there AI fashions—at this time. It’s also being examined by clients as a part of our restricted personal preview. To hitch the personal preview, please sign up here.

Many because of the groups throughout Microsoft working to enhance the safety of our clients, together with the Autonomous Code Safety workforce and the Microsoft Home windows Assault Analysis & Safety (WARP) whose work led to the findings on this publish.

We look ahead to sharing extra updates with clients and the {industry} as we work to make the world a safer place for all.