#Evals | Lilith AI

From Radar

Radar · 2026-05-29

Zvi reads the Claude Opus 4.8 system card as an audit of shifting risk

Zvi Mowshowitz analyzes Claude Opus 4.8 as an incremental upgrade with better capabilities, safety and new questions around evals.

Read →

Radar · 2026-05-27

ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics

IBM Research and Artificial Analysis released the first benchmark for enterprise IT agents in a realistic Kubernetes environment on 27 May 2026. The top model (Claude Opus 4.7) reached 47 %. No frontier model exceeded 50 %.

Read →

Radar · 2026-05-22

AI Snake Oil asks: did Google agents really build an OS for $916, or was it a carefully lit demo?

AI Snake Oil examines the claim that Google AI agents built an operating system for $916. The key point: agentic announcements need a different type of verification than chat benchmarks, because a big goal and a few steps in a demo environment are easy to inflate.

Read →

Radar · 2026-05-11

SocialReasoning-Bench: the agent completes the task but fails to improve the user's position

Microsoft Research describes SocialReasoning-Bench, a benchmark testing whether AI agents genuinely act in the user's best interest. Key finding: agents complete tasks technically, but do not consistently improve outcomes for the person, even when explicitly instructed to.

Read →

From the Glossary

Glossary

AI-assisted research — the model as a research partner

AI-assisted research uses models to find hypotheses, write code, test variants and read literature. It is not automatic science. It is a faster research loop with new ways to fall on your face.

Read →

Glossary

Evals and benchmarks — measurement instead of vibes

A benchmark is not truth carved in stone. It is an instrument with error bars. Without it, though, you are only guessing whether a model or agent works.

Read →

Glossary

Fine-tuning — a scalpel, not a universal hammer

Fine-tuning changes model weights. It is powerful when you have data, evals and a clear reason. It is an expensive mistake when it hides a bad prompt, missing RAG or an unclear process.

Read →

Glossary

Frontier model governance — who checks the model before release

Frontier model governance asks who tests the strongest models before deployment, under which rules and with what power to intervene. A voluntary audit, a system card and government testing are not the same thing.

Read →

Glossary

Golden Dataset — ground truth for an AI system, not a golden cage

A Golden Dataset is a small, carefully reviewed set of real cases used to tell whether an AI system actually works. In Skillmea AI we use it to evaluate course recommendations against lesson evidence, not marketing blurbs.

Read →