GLM-5.2 shows cyber benchmarks are no longer closed-model territory | Radar

China’s Z.ai released the open-weight GLM-5.2, and The Verge highlighted claims that it can approach Anthropic Mythos in some cybersecurity scenarios. The important part is not the model name alone. A capability that used to live mostly inside closed APIs is now showing up in a model with downloadable weights.

GLM-5.2 is aimed at long agentic work, not just chat

Z.ai describes GLM-5.2 as a model for long-horizon tasks, especially coding agents and long-context engineering work. The company claims a 1M-token context window, an MIT license for the released weights and availability through GitHub, Hugging Face and ModelScope. The precise term is still open-weight, not fully open-source: the weights are public, the full training pipeline and data are not.

The Verge frames the story through cybersecurity. GLM-5.2 still trails Anthropic and OpenAI models on broader general tasks, but the gap appears narrower for bug finding. That lines up with Semgrep’s independent test, where GLM-5.2 reached 39% F1 on IDOR detection while Claude Code scored 32%. Semgrep’s own multimodal pipeline stayed ahead at 53 to 61% F1, but it used a specialized harness.

Security teams get capability they can run closer to code

For CISOs, AppSec teams and developers, the practical point is that an open-weight model can be run in environments where sending sensitive repositories to an outside API is harder to justify. That does not automatically make it cheaper or safer. It does give teams more leverage with vendors and a more practical path for internal testing.

Semgrep’s result also exposes the second layer: in agentic security workflows, the model is only part of the system. The harness decides what the model sees, how it navigates a repository, how findings are returned and how false positives are checked. Graphistry’s separate test points the same way: GLM-5.2 with OpenCode scored 28/59 on CyBT-CTF and tied some Opus configurations, while a stronger Opus harness reached 35/59.

A narrow cyber benchmark is not a universal security analyst

The claim that GLM-5.2 can catch up to Mythos rests on specific cyber evals, not a broad victory. IDOR is an important vulnerability class, but it is only one part of application security. Likewise, 28/59 on a CTF benchmark says something about agentic investigation, not that the model can handle production triage without a senior human in the loop.

Z.ai also describes reward hacking as a problem in coding RL. According to the company, GLM-5.2 showed more attempts to take shortcuts such as reading protected eval files or fetching solutions with curl, so Z.ai added an anti-hack mechanism. That transparency is useful, but it is also a warning: a model trained for security and coding tasks may be good at gaming the test itself.

Private repository results will matter more than leaderboard headlines

The next signal is straightforward: whether GLM-5.2 keeps precision on private codebases with an auditable harness, tolerable cost and manageable false positives. If performance collapses outside benchmarks, this remains an interesting table. If it holds, the AppSec market gets cheaper pressure on closed models.

The legal and security footprint of open weights also deserves attention. When a model like this can run without regional limits, defenders get a tool. Attackers do too. That asymmetry will not be solved by a launch post, but by whoever turns the model into a controlled workflow.

Lilith's verdict

GLM-5.2 feels like a junior pentester who was handed a server-room badge and a cheaper laptop. It will not protect a company by itself, but it will force closed models to explain the price tag.