GLM-5.2 Tops Claude Code in Semgrep IDOR Benchmark

TL;DR

Benchmark End result: GLM-5.2 tops Semgrep’s prompt-only IDOR benchmark towards Anthropic’s Claude Code.
Harness Caveat: Semgrep Multimodal nonetheless scored greater, displaying endpoint discovery and guided navigation stay central.
Deployment Tradeoff: Open weights and low per-vulnerability value assist safety groups, however self-hosting might have severe GPU capability.
Scope Restrict: The result’s one access-control check, not a common rating throughout AI coding fashions.

GLM-5.2, Z.ai’s open-weight mannequin, has reached 39% F1 on Semgrep’s IDOR benchmark, beating Anthropic’s Claude Code coding assistant within the prompt-only lane. Claude Code scored 37% F1 with Opus 4.6 and 28% with Opus 4.8 or 4.7.

Semgrep Multimodal, the corporate’s harnessed safety pipeline, nonetheless reached 53% to 61% F1, retaining the consequence slim.

IDOR, brief for Insecure Direct Object Reference, is a typical access-control bug during which an utility exposes an inside object ID, reminiscent of a consumer, bill, or file identifier, with out checking whether or not the requester is allowed to entry it. Semgrep’s check measures how nicely AI coding methods detect that flaw, utilizing F1 to steadiness missed bugs towards false positives. The consequence issues as a result of GLM-5.2 is open weight, that means groups can run its launched mannequin parameters extra straight than a closed API, whereas Semgrep’s personal harness provides security-specific context gathering round fashions earlier than they make a discovering.

Safety groups can learn the consequence as a workflow choice slightly than a common mannequin rating. A prompt-only mannequin receives the identical vulnerability immediate with out Semgrep’s endpoint-discovery scaffolding. A harness can discover routes, floor parameters, acquire close by authorization checks, and switch mannequin output into findings {that a} reviewer can consider.

Rank	Configuration	Harness	F1
1	Semgrep Multimodal (GPT 5.5)	Semgrep Multimodal	61%
2	Semgrep Multimodal (Opus 4.8)	Semgrep Multimodal	53%
3	GLM 5.2	Pydantic AI (immediate solely)	39%
4	Claude Code (Opus 4.6)	Claude Code SDK	37%
5	Claude Code (Opus 4.8/4.7)	Claude Code SDK	28%
6	MiniMax M3	Pydantic AI (immediate solely)	23%
7	Kimi K2.7 Code	Pydantic AI (immediate solely)	22%
8	GPT-5.5	Codex	20%
9	Nemotron Tremendous 3 120B	Pydantic AI (immediate solely)	18%
10	DeepSeek V4	Pydantic AI (immediate solely)	17%

What Semgrep’s IDOR Take a look at Measured in Apply

Semgrep tested coding models towards the identical dataset, analysis methodology, and system immediate. Immediate-only fashions ran with out exterior knowledge augmentation or search features, retaining the comparability targeted on what the mannequin might infer from provided immediate and code context.

Semgrep’s harnessed lane added endpoint discovery, guided navigation, and output parsing across the mannequin. Essential: A model-to-model benchmark and an end-to-end safety workflow measure various things. Semgrep’s safety analysis crew put the caveat plainly: “The harness nonetheless issues greater than the mannequin.”

In an IDOR scan, a mannequin can classify code solely after the related route, parameter, consumer context, and authorization test are in view. Endpoint discovery and guided navigation assist clarify why the harnessed lane scored greater than bare-prompt fashions. Vulnerability detection is determined by context assortment and post-processing in addition to token prediction.

IDOR bugs typically sit in bizarre route handlers slightly than unique exploit chains. A detector that doesn’t see the thing identifier, caller id, and permission test collectively can miss the bug class even when its mannequin understands entry management. The consequence factors to security-tool plumbing as a lot as mannequin functionality.

Why Open Weights Change the Safety-Instrument Calculation

Zhipu AI’s GLM-5.1 predecessor had already made the road a developer-facing benchmark contender earlier than GLM-5.2 reached GLM Coding Plan members on June 13, 2026. GLM-5.2 open weights and launch notes arrived on June 16, 2026. The mannequin pairs a mixture-of-experts structure with a one-million-token context window, roughly 750 billion whole parameters, and about 40 billion energetic parameters per token.

OpenAI’s 2025 open-weight language mannequin launch gave builders skilled parameters they might run outdoors a vendor API. GLM-5.2 raises the identical management query when code, credentials, or inside endpoints are a part of the workload. Native weights can simplify knowledge dealing with for code overview groups, however they don’t take away the necessity for endpoint discovery, authorization reasoning, or scanner-ready output.

Price provides a second choice level. Within the prompt-only IDOR run, GLM-5.2 value roughly $0.17 per vulnerability found. Spare accelerator capability, mannequin serving, and review-integration work nonetheless determine whether or not native management is operationally helpful.

Broader Benchmarks and Competitor Context Keep Messy

Different GLM-5.2 comparisons reinforce the workload caveat. GLM-5.2 additionally matched Anthropic Opus 4.7 and 4.8 at solve-rate degree in Graphistry’s CyBT-CTF cybersecurity investigation comparability. Managed vulnerability duties can transfer mannequin rankings when benchmarks, guardrails, and overview workflows differ.

For procurement groups, the Semgrep consequence activates a narrower query: whether or not a mannequin can discover access-control bugs when the route, parameter, and authorization proof are packaged nicely sufficient for overview. GLM-5.2’s prompt-only lead is actual inside that check, however the main path nonetheless paired fashions with security-specific scaffolding.

What Safety Groups Ought to Watch Subsequent

June 22’s IDOR desk units the baseline for GLM-5.2’s security-tool case. Till Semgrep or one other safety lab publishes a second vulnerability-class consequence above Claude Code’s 37% F1 mark, GLM-5.2’s 39% F1 lead stays an IDOR procurement cue slightly than a workflow-wide benchmark.

Replication throughout non-IDOR vulnerability courses and manufacturing safety harnesses would present whether or not the consequence travels past one access-control check. Safety groups ought to deal with GLM-5.2 as a reputable open-weight candidate for testing, not as proof {that a} naked mannequin can exchange the context engineering round application-security overview.