Radar | Lilith AI

2026-06-03

12:00 · source ↗

Wasmer shows Codex as leverage for small teams, not a magic compiler

OpenAI says Wasmer used Codex to build Edge.js in two weeks instead of an estimated year and accelerated development 10x to 20x. The stronger point is not the number. It is the shift in the developer role: less typing, more steering risky model work.

The story here is not Codex writing a runtime. It is a small team handing the model a shovel while still standing at the pit with a helmet, a measuring tape and the authority to say stop.

#openai #coding

00:00 · source ↗

Reachy Mini gets MCP tools from Hugging Face Spaces

Hugging Face shows Reachy Mini calling MCP tools hosted in public Spaces. The interesting part is not a weather answer, but the split between the robot body and capabilities that can be shared and updated outside the app.

Forget the weather trick. The real moment comes when a small robot starts raising the question: who is allowed to put a new tool on the table and let it speak to the body?

#agents #huggingface #open-source

2026-06-02

16:48 · source ↗

GitHub is preparing for a world where agents write commits at scale

The Latent Space interview with Kyle Daigle frames GitHub as a platform under pressure from agentic coding. The point is not another Copilot feature, but whether infrastructure built for human pace can absorb software produced by machines.

GitHub is no longer asking whether agents can write code. It is staring at a pull request queue where a tired maintainer has to tell which robotic coworker helped and which one dumped work on the desk.

#agents #commentary #podcast

2026-06-01

00:00 · source ↗

Search should not be a button. It should be programmable infrastructure for agents

Perplexity describes Search as Code: an architecture where an agent does not call one monolithic search engine, but assembles a retrieval pipeline as code. The point is not a nicer search API. It is control over how evidence is found, filtered and verified.

Search as Code is not another pretty name for web search. It is the moment an agent stops browsing results like a human and starts building its own investigation pipeline: candidates, filters, evidence and a bin for noise.

#agents #tool-use #research #web

15:41 · source ↗

Video generation is moving from clip output to canvas agent

Latent Space frames xAI Grok Imagine, through an interview with Ethan He, as a move from one shot video generation toward video agents. The thesis will be proven less by demo quality than by whether the system can iterate through a whole creative task.

A video agent becomes interesting only when the human at the table stops being the prompt janitor. If every version has to be dragged out of the ditch by hand, it is still just a loud clip tool.

#agents #models #commentary #podcast

15:01 · source ↗

Opus 4.8 shows that behavior tuning is not a checklist of fixes

Zvi Mowshowitz reads Opus 4.8 through model welfare and argues that attempts to fix honesty, sycophancy and preference shaping can create new problems elsewhere. For teams deploying models, the reminder is that alignment is not a checklist.

A model upgrade is not changing a light bulb. It is a new colleague at the table: maybe more precise, maybe more cautious, but the whole team has to check whether it stopped speaking exactly when it should have spoken.

#models #policy #commentary #newsletter #agent-safety

13:03 · source ↗

Open models win on cost, but frontier intelligence still sells at a premium

Nathan Lambert argues that open and closed models are improving on different economic curves. The real question is not open source ideology, but where companies will keep paying a premium for the best model.

Open versus closed is not a war here. It is a drier scene: the CFO staring at the token bill while an engineer points to a pull request that would otherwise sit for three days.

#models #open-source #commentary #interconnects #post-training #rlhf

04:44 · source ↗

NVIDIA Cosmos 3 pushes physical AI into one model

NVIDIA released Cosmos 3 on Hugging Face as an open omni-model for world generation, physical reasoning and action generation.

Cosmos 3 is not another pretty robot video from a lab. It is an attempt to give builders one steering wheel instead of a box of mismatched levers.

#open-source #nvidia #physical-ai

2026-05-30

21:02 · source ↗

A service worker intercepts HTTP requests and handles them in a Python ASGI app running entirely in the browser

Simon Willison experiments with running Python ASGI apps directly in the browser using Pyodide and a service worker. FastAPI and a complete Datasette 1.0a31 both ran successfully. The point is distribution: demos or data tools as self-contained web pages without a server.

This approach does not replace a server. It reduces friction between idea and demo: a Python app as a web page, no deploy, no account, no server infrastructure.

#research #simonwillison #commentary #anthropic

2026-05-29

20:50 · source ↗

Zvi reads the Claude Opus 4.8 system card as an audit of shifting risk

Zvi Mowshowitz analyzes Claude Opus 4.8 as an incremental upgrade with better capabilities, safety and new questions around evals.

A system card is no longer an appendix for a few safety nerds. It is the receipt a model puts on the table, waiting to see who reads the fine print.

#evals #anthropic #safety

01:23 · source ↗

Anthropic crossed $47 billion run-rate revenue in five months and growth is accelerating

Simon Willison highlighted the number from Anthropic's Series H announcement: run-rate revenue crossed $47 billion. The trajectory is striking: $9 billion in December 2025, $30 billion in April, $47 billion in May 2026.

A $47 billion run-rate is the ledger where enterprise customers see for the first time what automated work costs when nobody sets limits. Somewhere in those numbers there is probably one badly configured usage policy.

#simonwillison #commentary #anthropic

2026-05-28

23:59 · source ↗

Opus 4.8 misses code flaws four times less often and introduces mid-conversation instruction updates

Anthropic shipped Opus 4.8 with one concrete metric: the model is four times less likely to miss code flaws than its predecessor. It also adds mid-conversation system messages and reduces the minimum prompt cache size from 4,096 to 1,024 tokens.

Opus 4.8 did not arrive with a keynote effect, but with a receipt: four times fewer missed code flaws and a model that prefers silence over a confident wrong answer. That is exactly the kind of honesty worth $25 per million tokens.

#models #simonwillison #commentary #anthropic

20:58 · source ↗

Google wants agents to propose hypotheses and write experimental code instead of the scientist

At I/O 2026, Google Research showed Gemini for Science, ERA and Co-Scientist as systems where AI takes over research middle steps: literature review, writing code, iterating hypotheses. Risks of false certainty and vendor lock-in are substantial.

Google is not just giving scientists a smarter chatbot here. It wants to build a lab where the agent writes the protocol and the human still has to watch for an elegantly formulated mistake sitting on the bench.

#research #google

18:41 · source ↗

Async agents receive a spec, work in an isolated VM and leave a pull request in the repository by morning

A Latent Space discussion with Cognition and OpenInspect frames coding agents as asynchronous workers: spec-to-PR workflows, full VMs, agent memory, and situations where a PM ships a code change without a developer. The shift is from synchronous chat to delegating an entire work cycle.

Chat was the training ground. The real change starts when an agent leaves a trace in the repository by morning that someone must accept or discard, and nobody knows exactly what it did during the night.

#agents #coding #devtools #workflow

16:00 · source ↗

Data Formulator 0.7 tries to rebuild enterprise data analytics around AI agents

Microsoft Research released Data Formulator 0.7, an analytics workspace where AI agents assist with exploration, transformation and visualization of enterprise data. The key question is whether the agent handles messy, permissioned data outside the demo.

Data Formulator targets the point where a table turns into a decision. The agent promises to take over the data preparation work, but in enterprise it will only succeed when it handles data that is not clean and never was.

#agents #research #microsoft

2026-05-27

23:44 · source ↗

SQLite draws a line: no agentic code, yes reproducible bugs

SQLite added an AGENTS.md file with a blunt rule for people pointing AI agents at the codebase: agentic code is not accepted, but high-quality reproducible bug reports can be useful. A small file, but a big signal for critical open source maintenance.

This is the grown-up answer to AI spam: do not ban everything, define what has value. Agent patch no, reproducible test yes. Maintainers protect time, quality and legal cleanliness at once.

#agents #simonwillison #commentary

17:20 · source ↗

ITBench-AA: frontier models score below 50 % on Kubernetes SRE diagnostics

IBM Research and Artificial Analysis released the first benchmark for enterprise IT agents in a realistic Kubernetes environment on 27 May 2026. The top model (Claude Opus 4.7) reached 47 %. No frontier model exceeded 50 %.

A frontier model at 47 % on SRE diagnostics is not a model failure. It is a hype failure. For anyone signing enterprise contracts for an AI agent in IT operations this year, these numbers are the first dose of reality.

#agents #evals #benchmarks #enterprise

16:56 · source ↗

Google proposes private analytics without one point of trust

Google Research presents a private analytics approach combining secure aggregation with TEEs for safer measurement of on-device AI.

This is less flashy than a new model, but more important for deployment. Somewhere in a user's pocket an AI system is running, and Google wants to know what it does without looking over their shoulder.

#google #privacy #on-device-ai

07:50 · source ↗

Last Week in AI maps a crowded week around OpenAI and Gemini

Last Week in AI #341 connects Musk losing against OpenAI, Gemini updates from IO 2026 and other AI market signals.

A crowded pinboard where a judge, Google product team and OpenAI researchers each pin their own note. There is no single grand thesis about the AI market behind it.

#openai #google #roundup

07:00 · source ↗

Codex helps build self-improving tax agents

OpenAI, Thrive Holdings and Crete built Tax AI for more than 30 accounting firms. The pilot processed 7,000 returns, saves about one third of practitioner time and improved sharply within six weeks through a feedback loop powered by Codex.

The most important part is not tax form automation by itself, but the operating model. Tax AI turns real practitioner failures into evals and Codex tasks, so the product improves on the exact cases that slow firms down. That is a practical picture of agentic software: humans keep accountability, the system absorbs repeat work and the product team gets a faster path from failure to fix.

#agents #openai #coding