#Enterprise | Lilith AI

From Radar

Radar · 2026-05-27

ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics

IBM Research and Artificial Analysis released the first benchmark for enterprise IT agents in a realistic Kubernetes environment on 27 May 2026. The top model (Claude Opus 4.7) reached 47 %. No frontier model exceeded 50 %.

Read →

Radar · 2026-05-11

SocialReasoning-Bench: the agent completes the task but fails to improve the user's position

Microsoft Research describes SocialReasoning-Bench, a benchmark testing whether AI agents genuinely act in the user's best interest. Key finding: agents complete tasks technically, but do not consistently improve outcomes for the person, even when explicitly instructed to.

Read →