Lilith Lilith.
CS EN PL
Start

What you measure

An eval is an attempt to turn “does it work?” into repeatable measurement. For models that can mean knowledge, reasoning, hallucination, safety or style. For agents add step count, tool-use success, cost, time and recovery from errors.

Why benchmarks lie

Models can overfit to a benchmark, datasets can be narrow, and a single score rarely describes the real task. One number is convenient, but it hides a lot of infernal grime: error distribution, prompt sensitivity, cost and edge failures.

For agents, measure the work, not the answer

A chat model answers and you can compare the text. An agent plans, calls tools, edits files and touches workflows — the output is a work trace, not one answer. That is why “we have 11 agents” is an empty metric. Measure:

  • Task success: was the goal actually met, or does it only look finished?
  • Human intervention rate: how often a person had to steer the agent back.
  • Cost and latency: tokens, tool calls and runtime.
  • Safety incidents: unwanted writes, data exposure, permission overreach.
  • Trace quality: can the log explain why the agent chose the next step?
  • Regression rate: does a new model or prompt break an old workflow?

Good evals are operational

A good eval has realistic data, clear scoring, a baseline, repeatability and negative cases; for agents, also a step log. It should not be an academic leaderboard but a repeatable set of tasks from the real environment, run before changing the model, tool layer or system prompt. If you cannot see where the agent made the wrong decision, you do not have an eval — only a leaderboard row.

Common traps

The worst eval is a demo the agent has effectively memorized. The second worst is measuring only the final text and ignoring the path: an agent can guess the right answer, burn absurd cost, cross a safety boundary or produce a patch nobody can maintain.

What to remember

Benchmarks are not prophecy. They are maps — a bad map is dangerous, but no map is just confident wandering. And judge an agent by the work, not the marketing slide: outcome, cost, interventions, risk and auditability.

Related from Radar