#benchmarks | Lilith AI

From Radar

Radar · 2026-06-15

Claude Opus 4.8 sells judgment, not just another benchmark

Anthropic released Claude Opus 4.8 at the same standard price as Opus 4.7, with a focus on coding, agentic tasks and longer work. The more important shift is a model that is supposed to say more often when it is unsure.

Read →

Radar · 2026-05-27

ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics

IBM Research and Artificial Analysis released the first benchmark for enterprise IT agents in a realistic Kubernetes environment on 27 May 2026. The top model (Claude Opus 4.7) reached 47 %. No frontier model exceeded 50 %.

Read →

Radar · 2026-05-06

SubQ review: great numbers, but still a test of benchmark faith

Fello AI reviews SubQ's claims: 12M token context window, 52x faster prefill than FlashAttention on 1M tokens and frontier-class benchmark positioning. The numbers are striking enough to need independent verification before they change architecture decisions.

Read →

Radar · 2025-12-16

FrontierScience tests AI scientific reasoning, but a lab's own benchmark needs independent audit

OpenAI introduces FrontierScience: a benchmark for scientific reasoning tasks in physics, chemistry, and biology, focused on reasoning processes rather than factual recall.

Read →

Radar · 2025-11-18

Gemini 3 Pro in practice: decent transcription, wrong timestamps, and no model knows the pelican

Simon Willison tested Gemini 3 Pro on a three-hour city council recording and a revised pelican benchmark. Result: a structured transcript for $1.42, but timestamps are off by tens of minutes. And none of the models tested understood that a California brown pelican is not actually brown.

Read →

Radar · 2025-10-29

OpenAI opens policy-based content classification with open-weight safeguard models

OpenAI released gpt-oss-safeguard-120b and 20b: open-weight reasoning models where content classification policy is not baked into the weights but supplied at runtime. Organizations bring their own rules; the model reasons over them.

Read →

Radar · 2025-09-05

Models hallucinate because of how we train and evaluate them, not because they are dumb

OpenAI's September 2025 post goes to the root of hallucinations: models learn to play the evaluation game, not to answer truthfully. If evals penalise admitted uncertainty more harshly than confident errors, models calibrate toward persuasiveness.

Read →

Radar · 2025-08-27

OpenAI and Anthropic tested each other's models. The findings are instructive, the methodology still open.

OpenAI and Anthropic published results of a joint safety evaluation: they tested each other's models for misalignment, instruction following, hallucinations, and jailbreaking. For the first time, two leading labs show where outside eyes find their blind spots.

Read →

Radar · 2025-07-02

Jack Morris goes against the current: information theory, not agents or benchmarks

Latent Space profiles Jack Morris, a PhD student who deliberately is not working on agents, benchmarks or VS Code forks. He studies the information-theoretic foundations of language models: embeddings, latent space and compression. This is a podcast interview and pointer.

Read →

From the Glossary

Glossary

Evals and benchmarks — measurement instead of vibes

A benchmark is not truth carved in stone. It is an instrument with error bars. Without it, though, you are only guessing whether a model or agent works.

Read →

Glossary

Model reliability — when a pretty answer is not enough

Reliability is about when the model knows, when it does not, when it invents, and how often its output can be trusted in production. Elegant wording is not evidence.

Read →