GLM-5.2 Tops Claude Code in Semgrep IDOR Benchmark


TL;DR

  • Benchmark End result: GLM-5.2 tops Semgrep’s prompt-only IDOR benchmark towards Anthropic’s Claude Code.
  • Harness Caveat: Semgrep Multimodal nonetheless scored greater, displaying endpoint discovery and guided navigation stay central.
  • Deployment Tradeoff: Open weights and low per-vulnerability value assist safety groups, however self-hosting might have severe GPU capability.
  • Scope Restrict: The result’s one access-control check, not a common rating throughout AI coding fashions.

GLM-5.2, Z.ai’s open-weight mannequin, has reached 39% F1 on Semgrep’s IDOR benchmark, beating Anthropic’s Claude Code coding assistant within the prompt-only lane. Claude Code scored 37% F1 with Opus 4.6 and 28% with Opus 4.8 or 4.7.

Semgrep Multimodal, the corporate’s harnessed safety pipeline, nonetheless reached 53% to 61% F1, retaining the consequence slim.

IDOR, brief for Insecure Direct Object Reference, is a typical access-control bug during which an utility exposes an inside object ID, reminiscent of a consumer, bill, or file identifier, with out checking whether or not the requester is allowed to entry it. Semgrep’s check measures how nicely AI coding methods detect that flaw, utilizing F1 to steadiness missed bugs towards false positives. The consequence issues as a result of GLM-5.2 is open weight, that means groups can run its launched mannequin parameters extra straight than a closed API, whereas Semgrep’s personal harness provides security-specific context gathering round fashions earlier than they make a discovering.

Safety groups can learn the consequence as a workflow choice slightly than a common mannequin rating. A prompt-only mannequin receives the identical vulnerability immediate with out Semgrep’s endpoint-discovery scaffolding. A harness can discover routes, floor parameters, acquire close by authorization checks, and switch mannequin output into findings {that a} reviewer can consider.

Rank

Configuration

Harness

F1

1

Semgrep Multimodal (GPT 5.5)

Semgrep Multimodal

61%

2

Semgrep Multimodal (Opus 4.8)

Semgrep Multimodal

53%

3

GLM 5.2

Pydantic AI (immediate solely)

39%

4

Claude Code (Opus 4.6)

Claude Code SDK

37%

5

Claude Code (Opus 4.8/4.7)

Claude Code SDK

28%

6

MiniMax M3

Pydantic AI (immediate solely)

23%

7

Kimi K2.7 Code

Pydantic AI (immediate solely)

22%

8

GPT-5.5

Codex

20%

9

Nemotron Tremendous 3 120B

Pydantic AI (immediate solely)

18%

10

DeepSeek V4

Pydantic AI (immediate solely)

17%

What Semgrep’s IDOR Take a look at Measured in Apply

Semgrep tested coding models towards the identical dataset, analysis methodology, and system immediate. Immediate-only fashions ran with out exterior knowledge augmentation or search features, retaining the comparability targeted on what the mannequin might infer from provided immediate and code context.

Semgrep’s harnessed lane added endpoint discovery, guided navigation, and output parsing across the mannequin. Essential: A model-to-model benchmark and an end-to-end safety workflow measure various things. Semgrep’s safety analysis crew put the caveat plainly: “The harness nonetheless issues greater than the mannequin.”