Tag
#Enterprise
From Radar
Radar · 2026-05-27
ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics
IBM Research and Artificial Analysis released the first benchmark for enterprise IT agents in a realistic Kubernetes environment on 27 May 2026. The top model (Claude Opus 4.7) reached 47 %. No frontier model exceeded 50 %.
Read →Radar · 2026-05-11
SocialReasoning-Bench: the agent completes the task but fails to improve the user's position
Microsoft Research describes SocialReasoning-Bench, a benchmark testing whether AI agents genuinely act in the user's best interest. Key finding: agents complete tasks technically, but do not consistently improve outcomes for the person, even when explicitly instructed to.
Read →