Radar — Lilith AI

2026-05-11

Quoting James Shore

19:48 · source ↗

Your AI coding agent, the one you use to write code, needs to reduce your maintenance costs. Not by a little bit, either. You write code twice as quick now? Better hope you’ve halved your maintenance costs. Three times as productive? One third the maintenance costs. Otherwise, you’re screwed. You’re trading a temporary speed boost for permanent indenture. [...] The math only works if the LLM decreases your maintenanc…

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Quoting James Shore is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #ai #models #coding #simonwillison #commentary

2026-05-08

Running Codex safely at OpenAI

12:30 · source ↗

How OpenAI runs Codex securely with sandboxing, approvals, network policies, and agent-native telemetry to support safe and compliant coding agent adoption

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Running Codex safely at OpenAI is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #openai #ai #coding

2026-05-07

Behind the Scenes Hardening Firefox with Claude Mythos Preview

17:56 · source ↗

Behind the Scenes Hardening Firefox with Claude Mythos Preview Fascinating, in-depth details on how Mozilla used their access to the Claude Mythos preview to locate and then fix hundreds of vulnerabilities in Firefox: Suddenly, the bugs are very good Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly corr…

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Behind the Scenes Hardening Firefox with Claude Mythos Preview is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#ai #models #coding #security #simonwillison #commentary

2026-05-06

AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields

10:43 · source ↗

Explore how AlphaEvolve's Gemini-powered algorithms are driving impact across business, infrastructure, and science

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #research #deepmind #ai #models #coding #google

2026-05-01

[AINews] Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work

04:53 · source ↗

a quiet day lets us reflect on coding agents "breaking containment"

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: [AINews] Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #ai #models #coding #commentary #podcast

2026-04-28

Our commitment to community safety

00:00 · source ↗

Learn how OpenAI protects community safety in ChatGPT through model safeguards, misuse detection, policy enforcement, and collaboration with safety experts

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Our commitment to community safety is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#openai #ai #models #policy #security

2026-04-23

GPT-5.5 Bio Bug Bounty

00:00 · source ↗

Explore the GPT-5.5 Bio Bug Bounty: a red-teaming challenge to find universal jailbreaks for bio safety risks, with rewards up to $25,000

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: GPT-5.5 Bio Bug Bounty is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#openai #ai #models #security

2026-04-21

Introducing ChatGPT Images 2.0

12:00 · source ↗

ChatGPT Images 2.0 introduces a state-of-the-art image generation model with improved text rendering, multilingual support, and advanced visual reasoning

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Introducing ChatGPT Images 2.0 is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#openai #ai #models #multimodal

2026-04-15

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

12:07 · source ↗

Radar note on Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents: a signal worth tracking for AI agents, model evaluation, safety, or production AI workflows.

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #huggingface #ai #open-source

2026-01-20

Cisco and OpenAI redefine enterprise engineering with AI agents

11:00 · source ↗

Cisco and OpenAI redefine enterprise engineering with Codex, an AI software agent embedded in workflows to speed builds, automate defect fixes, and enable AI-native development

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Cisco and OpenAI redefine enterprise engineering with AI agents is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #openai #ai #coding

2025-12-18

Introducing GPT-5.2-Codex

00:00 · source ↗

GPT-5.2-Codex is OpenAI’s most advanced coding model, offering long-horizon reasoning, large-scale code transformations, and enhanced cybersecurity capabilities

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Introducing GPT-5.2-Codex is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#openai #ai #models #coding #security

2025-12-16

Evaluating AI’s ability to perform scientific research tasks

09:00 · source ↗

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Evaluating AI’s ability to perform scientific research tasks is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#openai #benchmarks #ai

2025-11-19

GPT-5.1-Codex-Max System Card

00:00 · source ↗

This system card outlines the comprehensive safety measures implemented for GPT‑5.1-CodexMax. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt injections, and product-level mitigations like agent sandboxing and configurable network access

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: GPT-5.1-Codex-Max System Card is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #openai #ai #models #coding #security

2025-11-18

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

00:00 · source ↗

Radar note on Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark: a signal worth tracking for AI agents, model evaluation, safety, or production AI workflows.

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#benchmarks #ai #models #multimodal #simonwillison #commentary

2025-11-06

Code research projects with async coding agents like Claude Code and Codex

00:00 · source ↗

Radar note on Code research projects with async coding agents like Claude Code and Codex: a signal worth tracking for AI agents, model evaluation, safety, or production AI workflows.

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Code research projects with async coding agents like Claude Code and Codex is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #ai #models #coding #simonwillison #commentary

2025-11-02

New prompt injection papers: Agents Rule of Two and The Attacker Moves Second

00:00 · source ↗

Radar note on New prompt injection papers: Agents Rule of Two and The Attacker Moves Second: a signal worth tracking for AI agents, model evaluation, safety, or production AI workflows.

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: New prompt injection papers: Agents Rule of Two and The Attacker Moves Second is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #ai #security #simonwillison #commentary

2025-10-29

gpt-oss-safeguard technical report

00:00 · source ↗

gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from a provided policy in order to label content under that policy. In this report, we describe gpt-oss-safeguard’s capabilities and provide our baseline safety evaluations on the gpt-oss-safeguard models, using the underlying gpt-oss models as a baseline. For more informati…

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: gpt-oss-safeguard technical report is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#openai #benchmarks #ai #models #policy #security

2025-10-23

Introducing the Gemini 2.5 Computer Use model

18:40 · source ↗

Available in preview via the API, our Computer Use model is a specialized model built on Gemini 2.5 Pro’s capabilities to power agents that can interact with user interfaces

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: Introducing the Gemini 2.5 Computer Use model is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #research #deepmind #ai #models #google

2025-10-20

Claude Code for web — asynchronous coding agent in a sandbox

00:00 · source ↗

Simon Willison tested Claude Code for web: Anthropic wrapped the local Claude Code experience in a hosted sandbox and made it usable from web and mobile. The important shift is not glamour, but workflow: coding agents become more valuable when they can run asynchronously and safely away from your laptop.

Why it matters: The important part is not the web UI. The important part is that the coding agent becomes an asynchronous worker: you assign a task, let it run in an isolated environment, and return to a diff or PR.

takeThis is less a new editor and more delegation infrastructure. If an agent can run in YOLO mode without unlimited filesystem and network access, we can finally talk about productivity without signing a security suicide note.

#agents #ai #models #coding #simonwillison #commentary

2025-09-16

How GPT5 + Codex took over Agentic Coding — ft. Greg Brockman, OpenAI

00:16 · source ↗

Belated catchup on our podcast with Greg Brockman, + latest takes on the new GPT-5-Codex model

Why it matters: This belongs in Radar because it points to a concrete shift in how AI systems are built, evaluated, secured, sold, or operated. The practical question is not whether the headline sounds impressive, but whether it changes real workflows: developer tooling, agent safety, model evaluation, governance, or the cost of maintaining AI-assisted work.

takeWorth tracking, but not swallowing whole: How GPT5 + Codex took over Agentic Coding — ft. Greg Brockman, OpenAI is useful as a signal only if the mechanism, limits, and real operational impact survive scrutiny.

#agents #ai #models #coding #commentary #podcast