Opus 4.8 misses code flaws four times less often and introduces mid-conversation instruction updates | Radar

Anthropic shipped Opus 4.8 with one concrete metric attached: the model is four times less likely to miss code flaws than version 4.7. The deliberately restrained framing in the announcement, "modest but tangible improvement", is honest precisely because it does not paper over limits with superlatives. The improvement comes mainly through abstaining: the model prefers to decline an uncertain question rather than produce a confidently wrong answer.

Four times fewer missed code flaws and mid-conversation instruction updates

Opus 4.8 brings three specific changes. First, a reduced rate of overlooked code flaws (4x compared to 4.7). Second, mid-conversation system messages, meaning instructions can be updated mid-conversation without losing prompt cache efficiency. For agent loops this is practically useful: an agent can receive updated instructions without restarting the conversation. Third, the minimum prompt cache size drops from 4,096 to 1,024 tokens, lowering the barrier for shorter conversations.

Pricing is unchanged at $5 per million input tokens and $25 per million output tokens. Context window is 1 million tokens, maximum output is 128,000 tokens.

For developers, the change is who gets the confidently wrong answer

The llm-anthropic 0.25.1 update added Opus 4.8 support on day one, including a fast mode option for organizations with it enabled and updated default max_tokens values per model.

Willison's angle is practical: a new model matters when it can be quickly tested in tools developers already use. Support in the LLM CLI shows how a model moves from announcement to real use, through integration layers, scripts and repeatable experiments.

An incremental release without personal verification is still a promise, not a result

"Modest but tangible" means this is probably not a dramatic leap. The code-flaw metric is concrete, but worth validating on your own tasks. The best test will not be a marketing chart, but the same work where Opus 4.7 hit limits.

Anthropic's restrained language is refreshing in AI. The risk is that buyers start treating it as proof of excellence rather than a promise to verify.

The signal will be how many teams upgrade to 4.8 from experience, not from the announcement

Watch whether improvements show up in long-context work, coding tasks and answer stability. It also matters how quickly mid-conversation system messages get adopted in agent loop tooling and whether fast mode becomes a standard team option.

The real signal will be in production data: if teams report lower hallucination rates on their own tasks, the model delivered.

Lilith's verdict

Opus 4.8 did not arrive with a keynote effect, but with a receipt: four times fewer missed code flaws and a model that prefers silence over a confident wrong answer. That is exactly the kind of honesty worth $25 per million tokens.