Radar | Lilith AI

2025-10-20

00:00 · source ↗

Claude Code for web: an asynchronous coding agent in a sandbox, without your laptop

Simon Willison tested Claude Code for web: Anthropic wrapped the local Claude Code experience in a hosted sandbox and made it usable from web and mobile. The important shift is not a more capable model, but a workflow change: coding agents become more valuable when they can run asynchronously and safely away from your laptop.

This is less a new editor and more delegation infrastructure. If an agent can run in isolation without unlimited filesystem and network access, we can finally talk about productivity without signing a security suicide note.

#agents #ai #models #coding #simonwillison #commentary

2025-09-16

00:16 · source ↗

Latent Space: Greg Brockman on GPT-5 and Codex as the agentic layer of software development

Latent Space published a belated episode with Greg Brockman on GPT-5 and Codex, plus editorial takes on the GPT-5-Codex model combination. This is a podcast episode and pointer, not a standalone analytical essay.

Brockman is selling Codex as a new control layer for development, not a better autocomplete. That is a clear strategic message. The proof will not come from the podcast but from the first team that ships it without a safety net and gets working production back.

#agents #ai #models #coding #commentary #podcast

2025-09-05

10:00 · source ↗

Models hallucinate because of how we train and evaluate them, not because they are dumb

OpenAI's September 2025 post goes to the root of hallucinations: models learn to play the evaluation game, not to answer truthfully. If evals penalise admitted uncertainty more harshly than confident errors, models calibrate toward persuasiveness.

A model that never says it does not know is not smart. It is dangerous. As long as evals reward fluent answers over admitted ignorance, we will keep optimising for persuasive hallucinations.

#openai #benchmarks #ai #models #security

2025-08-27

10:00 · source ↗

OpenAI and Anthropic tested each other's models. The findings are instructive, the methodology still open.

OpenAI and Anthropic published results of a joint safety evaluation: they tested each other's models for misalignment, instruction following, hallucinations, and jailbreaking. For the first time, two leading labs show where outside eyes find their blind spots.

Two of the biggest AI labs showed each other where they failed to find their own bugs. A healthy start. What remains is making this a rule, not a press release.

#openai #benchmarks #ai #models #security

2025-07-02

15:00 · source ↗

Jack Morris goes against the current: information theory, not agents or benchmarks

Latent Space profiles Jack Morris, a PhD student who deliberately is not working on agents, benchmarks or VS Code forks. He studies the information-theoretic foundations of language models: embeddings, latent space and compression. This is a podcast interview and pointer.

In a moment when almost every researcher is building another agent or a new benchmark, it is worth watching the people who ask what models are actually doing under the hood. Morris's focus on information theory and latent representations is a quieter topic than Codex, but if it yields results it will reshape how embeddings and retrieval systems are designed for the next decade.

#agents #benchmarks #ai #models #coding #commentary #podcast

2025-06-25

00:00 · source ↗

Gartner: over 40% of agentic AI projects will be cancelled by 2027

Gartner estimates that over 40% of agentic AI projects will be cancelled by the end of 2027 because of cost, unclear value or weak risk controls. The signal is not that agents are dead. It is that unmanaged PoCs are entering the bill, governance and accountability wall.

This is the moment agents stop being demo candy and start being systems work. If a team cannot define authority, cost per completed task and accountability, it does not have a product. It has an expensive excuse machine.

#agents #ai-engineering #workflows #reliability