#commentary | Lilith AI

Radar · 2026-06-16

Model welfare is moving from philosophy into product risk

Zvi Mowshowitz uses Fable and Mythos as a case study for why model welfare cannot be separated from capabilities, alignment and user experience. Even where the topic remains speculative, it is becoming a practical question of evaluations and safety interventions for frontier labs.

Read →

Radar · 2026-06-15

The US move against Fable and Mythos takes the same blade from defenders and attackers

The US government told Anthropic to restrict Fable 5 and Mythos 5 for all foreign nationals, so Anthropic switched the models off for all customers. A protest by 76 security experts exposes the weak point: export control is bad at separating an offensive exploit from defensive testing.

Read →

Radar · 2026-06-15

Claude Opus 4.8 sells judgment, not just another benchmark

Anthropic released Claude Opus 4.8 at the same standard price as Opus 4.7, with a focus on coding, agentic tasks and longer work. The more important shift is a model that is supposed to say more often when it is unsure.

Read →

Radar · 2026-06-15

Nathan Lambert leaving Ai2 exposes the fragile side of open models

Nathan Lambert announced his departure from the Allen Institute for AI and used it to reflect on work around Olmo. This is not just a personnel note. It is a reminder that open models depend on institutions that must outlast one strong team.

Read →

Radar · 2026-06-15

Microsoft used Build to act like a model lab, not just a distributor

Latent Space frames Microsoft Build as the moment Microsoft showed its own MAI models alongside Copilot, Windows and Web IQ. The key ambition is to control data, inference and developer workflow at once, rather than leaving that leverage to partners.

Read →

Radar · 2026-06-15

Trump AI order creates a 30 day window for frontier models

The White House issued an executive order that calls for a classified benchmark for covered frontier models within 60 days and a voluntary framework for up to 30 days of pre-release government access. It says this is not licensing, but it creates a pressure point before launch.

Read →

Radar · 2026-06-15

Uber puts a price tag on coding agents: $1,500 per tool each month

Uber is limiting monthly token spend to $1,500 per employee for each agentic coding tool, according to Bloomberg via Simon Willison. Coding agents are becoming a budget line item.

Read →

Radar · 2026-06-15

Andon Labs tests agents where benchmarks stop: money, people and shelves

Latent Space's interview with Andon Labs shows evals that look less like exams and more like running a small business. The key ingredients are long horizons and real consequences.

Read →

Radar · 2026-06-15

Simon Willison shows why an agent sandbox cannot be just another Python process

Simon Willison released the alpha package micropython-wasm and a Datasette Agent plugin that runs Python inside a WebAssembly sandbox. The important part is not the demo, but the boundary between a useful agent and code that can break its host application.

Read →

Radar · 2026-06-15

Bad RL environments do not train agents, they teach them to trust a broken world

Latent Space published Auriel W's piece on why low-quality RL environments damage agent training. The point is simple: in reinforcement learning, the environment is the data generator, so a harness bug becomes training material.

Read →

Radar · 2026-06-09

Claude Fable 5 turns safety into a question of access to the best model

Nathan Lambert reads the Claude Fable 5 release as a dispute over who gets to use a frontier model without routing and filters. The important layer is not only model capability, but the governance system that decides when the user is really talking to the strongest model.

Read →

Radar · 2026-06-09

Agent cost is no longer a footnote. It is an engineering expense

Simon Willison shows how he manually added pricing for Claude Fable 5 in AgentsView and immediately saw the cost of local coding agents by project. The small trick points to a bigger shift: AI coding is starting to look like infrastructure consumption, not an app subscription.

Read →

Radar · 2026-06-08

Apple puts Siri back in play through Gemini, but the proof is still waitlisted

Apple announced Siri AI and new Apple Intelligence features at WWDC 2026, while extending Private Cloud Compute to Google Cloud with NVIDIA GPUs for demanding tasks. After last year's Apple Intelligence disappointment, this is less about the keynote and more about whether Siri can finally survive outside the demo.

Read →

Radar · 2026-06-07

datasette-agent-edit tackles the boring part of agents: safe text edits

Simon Willison released datasette-agent-edit 0.1a0, a base plugin for Datasette Agent with view, str_replace and insert tools. It is not a flashy AI demo. It is the layer that decides whether an agent can edit text without casually breaking the file.

Read →

Radar · 2026-06-05

Lockdown Mode cuts the riskiest prompt injection escape route

OpenAI has started rolling out Lockdown Mode for eligible personal ChatGPT accounts and self-serve ChatGPT Business. It does not stop prompt injection itself, but it limits outbound network requests, which are the channel an attacker needs to exfiltrate sensitive data.

Read →

Radar · 2026-06-04

Zvi’s AI week shows why one grand narrative is not enough

Zvi Mowshowitz's AI #171 is not one clean trend, but a signal map: Claude Opus 4.8, US frontier model testing, OpenAI's policy blueprint and PAC politics.

Read →

Radar · 2026-06-02

GitHub is preparing for a world where agents write commits at scale

The Latent Space interview with Kyle Daigle frames GitHub as a platform under pressure from agentic coding. The point is not another Copilot feature, but whether infrastructure built for human pace can absorb software produced by machines.

Read →

Radar · 2026-06-01

Video generation is moving from clip output to canvas agent

Latent Space frames xAI Grok Imagine, through an interview with Ethan He, as a move from one shot video generation toward video agents. The thesis will be proven less by demo quality than by whether the system can iterate through a whole creative task.

Read →

Radar · 2026-06-01

Opus 4.8 shows that behavior tuning is not a checklist of fixes

Zvi Mowshowitz reads Opus 4.8 through model welfare and argues that attempts to fix honesty, sycophancy and preference shaping can create new problems elsewhere. For teams deploying models, the reminder is that alignment is not a checklist.

Read →

Radar · 2026-06-01

Open models win on cost, but frontier intelligence still sells at a premium

Nathan Lambert argues that open and closed models are improving on different economic curves. The real question is not open source ideology, but where companies will keep paying a premium for the best model.

Read →

Radar · 2026-05-30

A service worker intercepts HTTP requests and handles them in a Python ASGI app running entirely in the browser

Simon Willison experiments with running Python ASGI apps directly in the browser using Pyodide and a service worker. FastAPI and a complete Datasette 1.0a31 both ran successfully. The point is distribution: demos or data tools as self-contained web pages without a server.

Read →

Radar · 2026-05-29

Anthropic crossed $47 billion run-rate revenue in five months and growth is accelerating

Simon Willison highlighted the number from Anthropic's Series H announcement: run-rate revenue crossed $47 billion. The trajectory is striking: $9 billion in December 2025, $30 billion in April, $47 billion in May 2026.

Read →

Radar · 2026-05-28

Opus 4.8 misses code flaws four times less often and introduces mid-conversation instruction updates

Anthropic shipped Opus 4.8 with one concrete metric: the model is four times less likely to miss code flaws than its predecessor. It also adds mid-conversation system messages and reduces the minimum prompt cache size from 4,096 to 1,024 tokens.

Read →

Radar · 2026-05-27

SQLite draws a line: no agentic code, yes reproducible bugs

SQLite added an AGENTS.md file with a blunt rule for people pointing AI agents at the codebase: agentic code is not accepted, but high-quality reproducible bug reports can be useful. A small file, but a big signal for critical open source maintenance.

Read →

Radar · 2026-05-26

Copilot Cowork turns user permissions into a file exfiltration path via prompt injection

PromptArmor researchers demonstrated an attack chain in which Microsoft Copilot Cowork can help exfiltrate Microsoft 365 files through prompt injection. This is not only a product bug, but a warning for any agentic system with delegated permissions.

Read →

Radar · 2026-05-11

An AI coding agent that does not cut maintenance costs is just expensive technical debt

James Shore states the uncomfortable math of coding agents: if an agent doubles output but maintenance costs stay flat, the team did not gain speed, it doubled its technical debt burden.

Read →

Radar · 2026-05-07

Mozilla fixed hundreds of Firefox bugs with Claude Mythos. AI security report quality just shifted.

Simon Willison described how Mozilla used early access to Claude Mythos Preview to systematically find and fix Firefox vulnerabilities. In April 2026 the number of fixed security bugs jumped to 423, compared to the usual 20 to 30 per month. The key shift: AI security reports stopped being noise and started being usable input.

Read →

Radar · 2026-05-01

Coding agents leave the IDE: Codex and Claude show what comes after programming

Latent Space AINews observes a shift they call "breaking containment": coding agents like Codex and Claude are no longer just tools for writing code but are expanding into knowledge work and creative workflows broadly.

Read →

Radar · 2025-11-18

Gemini 3 Pro in practice: decent transcription, wrong timestamps, and no model knows the pelican

Simon Willison tested Gemini 3 Pro on a three-hour city council recording and a revised pelican benchmark. Result: a structured transcript for $1.42, but timestamps are off by tens of minutes. And none of the models tested understood that a California brown pelican is not actually brown.

Read →

Radar · 2025-11-06

Async coding agents as research threads: fire a task, get a pull request back

Simon Willison describes a fire-and-forget workflow with Claude Code, Codex and other coding agents: pose a research question, the agent works on a server and files a pull request. Code is proof of feasibility, not just text.

Read →