SocialReasoning-Bench: the agent completes the task but fails to improve the user's position
Microsoft Research describes SocialReasoning-Bench, a benchmark testing whether AI agents genuinely act in the user's best interest. Key finding: agents complete tasks technically, but do not consistently improve outcomes for the person, even when explicitly instructed to.
An agent that can click is not yet a user advocate. The real test starts when someone needs a better contract, not just a neatly completed form.