Lilith Lilith.
CS EN PL
Start

From Radar

Radar · 2026-06-15

Claude Opus 4.8 sells judgment, not just another benchmark

Anthropic released Claude Opus 4.8 at the same standard price as Opus 4.7, with a focus on coding, agentic tasks and longer work. The more important shift is a model that is supposed to say more often when it is unsure.

Read

Radar · 2026-05-27

ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics

IBM Research and Artificial Analysis released the first benchmark for enterprise IT agents in a realistic Kubernetes environment on 27 May 2026. The top model (Claude Opus 4.7) reached 47 %. No frontier model exceeded 50 %.

Read

Radar · 2026-05-06

SubQ review: great numbers, but still a test of benchmark faith

Fello AI reviews SubQ's claims: 12M token context window, 52x faster prefill than FlashAttention on 1M tokens and frontier-class benchmark positioning. The numbers are striking enough to need independent verification before they change architecture decisions.

Read

Radar · 2025-12-16

FrontierScience tests AI scientific reasoning, but a lab's own benchmark needs independent audit

OpenAI introduces FrontierScience: a benchmark for scientific reasoning tasks in physics, chemistry, and biology, focused on reasoning processes rather than factual recall.

Read

Radar · 2025-11-18

Gemini 3 Pro in practice: decent transcription, wrong timestamps, and no model knows the pelican

Simon Willison tested Gemini 3 Pro on a three-hour city council recording and a revised pelican benchmark. Result: a structured transcript for $1.42, but timestamps are off by tens of minutes. And none of the models tested understood that a California brown pelican is not actually brown.

Read

Radar · 2025-10-29

OpenAI opens policy-based content classification with open-weight safeguard models

OpenAI released gpt-oss-safeguard-120b and 20b: open-weight reasoning models where content classification policy is not baked into the weights but supplied at runtime. Organizations bring their own rules; the model reasons over them.

Read

Radar · 2025-09-05

Models hallucinate because of how we train and evaluate them, not because they are dumb

OpenAI's September 2025 post goes to the root of hallucinations: models learn to play the evaluation game, not to answer truthfully. If evals penalise admitted uncertainty more harshly than confident errors, models calibrate toward persuasiveness.

Read

Radar · 2025-08-27

OpenAI and Anthropic tested each other's models. The findings are instructive, the methodology still open.

OpenAI and Anthropic published results of a joint safety evaluation: they tested each other's models for misalignment, instruction following, hallucinations, and jailbreaking. For the first time, two leading labs show where outside eyes find their blind spots.

Read

Radar · 2025-07-02

Jack Morris goes against the current: information theory, not agents or benchmarks

Latent Space profiles Jack Morris, a PhD student who deliberately is not working on agents, benchmarks or VS Code forks. He studies the information-theoretic foundations of language models: embeddings, latent space and compression. This is a podcast interview and pointer.

Read

From the Glossary