#AI | Lilith AI

Radar · 2026-05-14

AgentMail gives AI agents their own email inbox as a first-class identity

AgentMail provides real email inbox infrastructure for AI agents: inbox creation, sending, receiving, threads, attachments, webhooks, WebSockets, search, custom domains and MCP integration. The company announced a 6M USD seed round led by General Catalyst with Y Combinator participation.

Read →

Radar · 2026-05-13

“11 AI agents” is an empty metric

Simon Willison highlighted Boris Mann's point that saying "11 AI agents" is meaningless by itself. It says about as much as counting spreadsheets or browser tabs. The useful questions are outcomes, responsibility boundaries, workflow, handoff, observability, failure handling, permissions and human review.

Read →

Radar · 2026-05-11

CodexBar unifies limit tracking for 29 AI coding tools in one icon

CodexBar is an open-source macOS menu-bar app that unifies limit tracking, credits, reset windows, and incident status across 29 AI coding providers including Codex, Claude, Cursor, Gemini, Copilot and OpenRouter.

Read →

Radar · 2026-05-11

An AI coding agent that does not cut maintenance costs is just expensive technical debt

James Shore states the uncomfortable math of coding agents: if an agent doubles output but maintenance costs stay flat, the team did not gain speed, it doubled its technical debt burden.

Read →

Radar · 2026-05-08

Codex gets a safety architecture, not just a README disclaimer

OpenAI details how Codex runs in isolated environments: per-repo sandboxes, network restrictions, approval gates, and agent-native telemetry for safe enterprise adoption.

Read →

Radar · 2026-05-07

Mozilla fixed hundreds of Firefox bugs with Claude Mythos. AI security report quality just shifted.

Simon Willison described how Mozilla used early access to Claude Mythos Preview to systematically find and fix Firefox vulnerabilities. In April 2026 the number of fixed security bugs jumped to 423, compared to the usual 20 to 30 per month. The key shift: AI security reports stopped being noise and started being usable input.

Read →

Radar · 2026-05-06

AlphaEvolve finds algorithms in days that teams spent months on, with production numbers

DeepMind introduced AlphaEvolve as a Gemini-powered evolutionary loop that automatically discovers better algorithms. Concrete production results: 30 % fewer errors in genomics, 20 % lower write amplification for Spanner, Klarna doubled transformer training speed.

Read →

Radar · 2026-05-06

SubQ review: great numbers, but still a test of benchmark faith

Fello AI reviews SubQ's claims: 12M token context window, 52x faster prefill than FlashAttention on 1M tokens and frontier-class benchmark positioning. The numbers are striking enough to need independent verification before they change architecture decisions.

Read →

Radar · 2026-05-05

Subquadratic raises $29M for 12M-token context windows

Subquadratic has launched with $29 million in seed funding and introduced SubQ, a model built on a subquadratic architecture and sparse attention to push context windows as high as 12 million tokens. The promise is longer context, higher speed, better accuracy and lower cost. The proof still needs independent benchmarks.

Read →

Radar · 2026-05-01

Coding agents leave the IDE: Codex and Claude show what comes after programming

Latent Space AINews observes a shift they call "breaking containment": coding agents like Codex and Claude are no longer just tools for writing code but are expanding into knowledge work and creative workflows broadly.

Read →

Radar · 2026-04-28

OpenAI layers ChatGPT safety from model to abuse detection, but the numbers are missing

OpenAI outlines its layered approach to ChatGPT community safety: model safeguards, abuse detection, policy enforcement, and collaboration with external safety experts.

Read →

Radar · 2026-04-23

OpenAI pays up to $25,000 for bio jailbreaks in GPT-5.5, but proof will be in aggregate results

OpenAI launches a bio bug bounty targeting universal jailbreaks in GPT-5.5, with rewards up to $25,000 for critical biological safety findings.

Read →

Radar · 2026-04-21

ChatGPT Images 2.0 finally handles text in graphics, but production needs independent testing

ChatGPT Images 2.0 brings improved image generation focused on text accuracy in graphics, multilingual support, and advanced visual reasoning for production workflows.

Read →

Radar · 2026-04-15

VAKRA benchmark reveals where agents actually fail: tool selection, arguments, multi-step planning

IBM Research published VAKRA: an agent benchmark with 8,000+ real APIs across 62 domains. It evaluates full execution trajectories, not just final answers. Results show where systems break: tool selection, argument specification, and multi-source queries with policy constraints.

Read →

Radar · 2026-01-20

Cisco deployed Codex for enterprise defect fixes, but hard numbers are still missing

Cisco and OpenAI describe deploying Codex as an agent in enterprise engineering workflows: build automation, defect fixes, and a shift toward agent-native development.

Read →

Radar · 2025-12-18

GPT-5.2-Codex targets long-horizon refactors, proof will be independent production tests

GPT-5.2-Codex targets long-horizon coding tasks across large context: large-scale code transformations, security fixes, and multi-file consistency.

Read →

Radar · 2025-12-16

FrontierScience tests AI scientific reasoning, but a lab's own benchmark needs independent audit

OpenAI introduces FrontierScience: a benchmark for scientific reasoning tasks in physics, chemistry, and biology, focused on reasoning processes rather than factual recall.

Read →

Radar · 2025-11-19

GPT-5.1-Codex-Max system card is worth reading, but trust it in proportion to its limits specificity

The GPT-5.1-Codex-Max system card describes two safety layers: model-level safety training and prompt injection protection, and product-level sandboxing with configurable network access.

Read →

Radar · 2025-11-18

Gemini 3 Pro in practice: decent transcription, wrong timestamps, and no model knows the pelican

Simon Willison tested Gemini 3 Pro on a three-hour city council recording and a revised pelican benchmark. Result: a structured transcript for $1.42, but timestamps are off by tens of minutes. And none of the models tested understood that a California brown pelican is not actually brown.

Read →

Radar · 2025-11-06

Async coding agents as research threads: fire a task, get a pull request back

Simon Willison describes a fire-and-forget workflow with Claude Code, Codex and other coding agents: pose a research question, the agent works on a server and files a pull request. Code is proof of feasibility, not just text.

Read →

Radar · 2025-11-02

Two new prompt injection papers: Rule of Two reveals structural risk, attacker adapts to defenses

Simon Willison highlighted two new papers on agent prompt injection. Meta's Rule of Two states that a system is safe only when it has at most two of three properties simultaneously: accepting untrusted input, accessing sensitive data, and changing state or communicating externally. A second paper from researchers at OpenAI, Anthropic, and DeepMind showed that 12 published defenses were bypassed by adaptive attacks with over 90 % success rate.

Read →

Radar · 2025-10-29

OpenAI opens policy-based content classification with open-weight safeguard models

OpenAI released gpt-oss-safeguard-120b and 20b: open-weight reasoning models where content classification policy is not baked into the weights but supplied at runtime. Organizations bring their own rules; the model reasons over them.

Read →

Radar · 2025-10-23

Gemini 2.5 Computer Use: DeepMind builds a dedicated model for agents that click instead of calling an API

Google DeepMind released Gemini 2.5 Computer Use in preview: a specialized model for agents that drive user interfaces. Unlike general-purpose Gemini 2.5 Pro, this model was trained specifically for screen interaction, not just reasoning about it.

Read →

Radar · 2025-10-20

Claude Code for web: an asynchronous coding agent in a sandbox, without your laptop

Simon Willison tested Claude Code for web: Anthropic wrapped the local Claude Code experience in a hosted sandbox and made it usable from web and mobile. The important shift is not a more capable model, but a workflow change: coding agents become more valuable when they can run asynchronously and safely away from your laptop.

Read →

Radar · 2025-09-16

Latent Space: Greg Brockman on GPT-5 and Codex as the agentic layer of software development

Latent Space published a belated episode with Greg Brockman on GPT-5 and Codex, plus editorial takes on the GPT-5-Codex model combination. This is a podcast episode and pointer, not a standalone analytical essay.

Read →

Radar · 2025-09-05

Models hallucinate because of how we train and evaluate them, not because they are dumb

OpenAI's September 2025 post goes to the root of hallucinations: models learn to play the evaluation game, not to answer truthfully. If evals penalise admitted uncertainty more harshly than confident errors, models calibrate toward persuasiveness.

Read →

Radar · 2025-08-27

OpenAI and Anthropic tested each other's models. The findings are instructive, the methodology still open.

OpenAI and Anthropic published results of a joint safety evaluation: they tested each other's models for misalignment, instruction following, hallucinations, and jailbreaking. For the first time, two leading labs show where outside eyes find their blind spots.

Read →

Radar · 2025-07-02

Jack Morris goes against the current: information theory, not agents or benchmarks

Latent Space profiles Jack Morris, a PhD student who deliberately is not working on agents, benchmarks or VS Code forks. He studies the information-theoretic foundations of language models: embeddings, latent space and compression. This is a podcast interview and pointer.

Read →