The Mythos fight shows why one benchmark cannot carry a security headline | Radar

Zvi Mowshowitz criticizes a Wall Street Journal article claiming Chinese AI systems have matched Anthropic’s Mythos in some cybersecurity scenarios. The real dispute is not whether GLM-5.2 is strong. It is whether a specific benchmark justifies a headline about matching Mythos.

The headline turned a narrow test into a race against all of Mythos

Mowshowitz’s objection rests on a capability distinction. One task is finding a vulnerability when a model is pointed at relevant code or given a bounded challenge. Another is autonomously searching a large space, finding vulnerabilities without precise guidance and chaining several findings into a working exploit.

He argues that WSJ’s phrase about matching Mythos „in some cybersecurity scenarios“ may be narrowly true, but the headline creates a broader impression that Chinese models have matched Anthropic where Mythos matters most. The WSJ article itself sits behind a paywall, so the verifiable surface here is the quoted wording and the public rebuttal around it.

Semgrep adds useful context. In its own IDOR benchmark, it reported GLM-5.2 at 39% F1 versus Claude Code at 32%, while its purpose built multimodal pipeline reached 53 to 61% F1. That is an interesting result, but it does not automatically support a claim about autonomous exploit construction.

For security teams, the harness matters more than the model’s flag

The operational impact is less geopolitical and more practical. In security, the model is only one piece of the system. The harness decides what code it sees, how context is supplied, how findings are verified, how tests are run and who approves an action with real impact.

GLM-5.2 as an open-weight model can be attractive, especially for teams seeking lower cost, local control and less dependence on US APIs. That alone does not mean it does the same thing as a restricted closed system shaped around specific security capabilities.

For security managers, the lesson is simple: do not buy „matched Anthropic“ as a slogan. Ask for reproducible tests on your own code, a clear description of agent permissions and metrics for false positives as well as actually exploitable findings.

Bug finding is not the same as an agent chaining exploits

The weakness in the public debate is that the word cybersecurity hides very different tasks: static analysis, CTF solving, triage, vulnerability reproduction and fully autonomous attack. Each carries a different risk profile.

If a model finds an IDOR in a well prepared benchmark, that is useful. If it walks through a large system without detailed guidance and combines several bugs into a working intrusion, that belongs to a different security category. Headlines often erase exactly that boundary.

Public evals, not screenshots from one race, should settle the claim

The next useful signal must come from evals that separate bug discovery from verified exploitation and from sustained autonomy. One F1 score or an OpenRouter rank is not enough without knowing the model’s context, tools and permissions.

The comparison that matters will test multiple models in the same harness, on the same repositories and with public scoring rules. Until then, these headlines are better read as evidence of pressure from open models, not as a final verdict on the end of the US lead.

Lilith's verdict

A benchmark is a useful thermometer and a terrible judge. When a newspaper puts it in the robe, the security team is left holding a chart while an uninvited attacker waits in the hallway.