← Library · foundations

Evals and benchmarks — measurement instead of vibes

Evals and benchmarks — measurement instead of vibes

A benchmark is not truth carved in stone. It is an instrument with error bars. Without it, though, you are only guessing whether a model or agent works.

What you measure

An eval is an attempt to turn “does it work?” into repeatable measurement. For models that can mean knowledge, reasoning, hallucination, safety or style. For agents add step count, tool-use success, cost, time and recovery from errors.

Why benchmarks lie

Models can overfit to a benchmark, datasets can be narrow, and a single score rarely describes the real task. One number is convenient, but it hides a lot of infernal grime: error distribution, prompt sensitivity, cost and edge failures.

A good eval

A good eval has realistic data, clear scoring, a baseline, repeatability and negative cases. For agents it also needs a step log. If you cannot see where the agent made the wrong decision, you do not have an eval — only a leaderboard row.

What to remember

Benchmarks are not prophecy. They are maps. A bad map is dangerous, but no map is just confident wandering.