ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics | Radar

ITBench-AA is the first benchmark that tests frontier models on enterprise IT tasks in a realistic environment. The result is uncomfortably specific: the top model, Claude Opus 4.7, reached 47 %. GPT-5.5 scored 46 %. No frontier model exceeded 50 %.

Kubernetes incident diagnostics under a strict scoring rule: nobody crossed the halfway mark

IBM Research released the benchmark in collaboration with Artificial Analysis on 27 May 2026. It contains 59 SRE (Site Reliability Engineering) tasks: 40 public and 19 held-out. Tasks simulate Kubernetes incident diagnostics in a sandboxed environment. The agent receives a system snapshot, has shell access, reads logs, traces dependencies and must identify root-cause entities (Deployments, Services, Pods) without damaging the system. There is a 100-turn cap and each task is repeated 3 times.

Scoring is strict. If the agent misses even one true root cause, the score is 0. The full score is precision at full recall. Concrete results: Claude Opus 4.7 with Adaptive Reasoning 47 %, GPT-5.5 (xhigh) 46 %, Qwen3.7 Max 42 %, Gemini 3.5 Flash (high) 40 %, DeepSeek V4 Pro 38 %.

For enterprise IT adoption this sends a clear message: the gap is real

Companies treat agents as candidates for IT support, configuration, incident management and routine administration. This domain does not tolerate creative mistakes. A wrong step in Kubernetes can change permissions, break configuration or create a security incident.

ITBench-AA moves the debate from general impressions to operational capability. The results say that the gap between a demo agent and a reliable enterprise agent in the SRE context is real and still wide. A product that sounds smart in chat can still fail on the details of enterprise workflow.

ITBench-AA measures SRE diagnostics, not the full breadth of enterprise IT

ITBench-AA measures a specific type of task: Kubernetes SRE diagnostics. How agents handle other enterprise IT domains (ITSM, IAM, network configuration) this benchmark does not address.

One question the benchmark does not yet measure precisely is process safety. In enterprise IT a badly executed action is worse than no action. If an agent correctly identifies the root cause but changes an unrelated configuration along the way, the score does not capture that. The benchmark measures the final result, not the cleanliness of the process.

The real breakthrough will come when model labs cite this benchmark in release notes

The signal to watch: whether ITBench-AA or similar operational benchmarks start appearing as targets in model lab and agent platform release notes. If it becomes part of the standard eval stack, it will pressure teams to improve tool use, audit logging and sandboxing.

The second signal is progress from specialised agents. Enterprise IT may be less about the largest model and more about the right environment, permissions and safe operating procedures.

Lilith's verdict

A frontier model at 47 % on SRE diagnostics is not a model failure. It is a hype failure. For anyone signing enterprise contracts for an AI agent in IT operations this year, these numbers are the first dose of reality.