Radar | Lilith AI

2026-05-11

17:19 · source ↗

SocialReasoning-Bench: the agent completes the task but fails to improve the user's position

Microsoft Research describes SocialReasoning-Bench, a benchmark testing whether AI agents genuinely act in the user's best interest. Key finding: agents complete tasks technically, but do not consistently improve outcomes for the person, even when explicitly instructed to.

An agent that can click is not yet a user advocate. The real test starts when someone needs a better contract, not just a neatly completed form.

#agents #evals #safety #enterprise

2026-05-08

12:30 · source ↗

Codex gets a safety architecture, not just a README disclaimer

OpenAI details how Codex runs in isolated environments: per-repo sandboxes, network restrictions, approval gates, and agent-native telemetry for safe enterprise adoption.

Agent safety is moving from footnote to product architecture. The team that skips it will eventually discover their agent had root access to the repository and nobody knows what it did there.

#agents #openai #ai #coding

2026-05-07

17:56 · source ↗

Mozilla fixed hundreds of Firefox bugs with Claude Mythos. AI security report quality just shifted.

Simon Willison described how Mozilla used early access to Claude Mythos Preview to systematically find and fix Firefox vulnerabilities. In April 2026 the number of fixed security bugs jumped to 423, compared to the usual 20 to 30 per month. The key shift: AI security reports stopped being noise and started being usable input.

A 20-year-old Firefox bug fixed by an AI agent is not a marketing story. It is proof that security auditing can scale to parts of the codebase humans never reached. What remains is finding out who can repeat this without privileged preview access.

#ai #models #coding #security #simonwillison #commentary

2026-05-06

10:43 · source ↗

AlphaEvolve finds algorithms in days that teams spent months on, with production numbers

DeepMind introduced AlphaEvolve as a Gemini-powered evolutionary loop that automatically discovers better algorithms. Concrete production results: 30 % fewer errors in genomics, 20 % lower write amplification for Spanner, Klarna doubled transformer training speed.

AlphaEvolve does not help a programmer write. It searches the solution space and returns runnable code. The first team to point it at a problem they did not know could be automated gains an asymmetric edge.

#agents #research #deepmind #ai #models #coding #google

01:49 · source ↗

SubQ review: great numbers, but still a test of benchmark faith

Fello AI reviews SubQ's claims: 12M token context window, 52x faster prefill than FlashAttention on 1M tokens and frontier-class benchmark positioning. The numbers are striking enough to need independent verification before they change architecture decisions.

If SubQ delivers, RAG teams will have an uncomfortable morning. If it does not, it will be another altar where the phrase 'revolutionary architecture' burned. Right now: interesting, sharp, unproven.

#efficiency #benchmarks #ai #models #coding

2026-05-05

12:00 · source ↗

Subquadratic raises $29M for 12M-token context windows

Subquadratic has launched with $29 million in seed funding and introduced SubQ, a model built on a subquadratic architecture and sparse attention to push context windows as high as 12 million tokens. The promise is longer context, higher speed, better accuracy and lower cost. The proof still needs independent benchmarks.

Subquadratic is selling a very attractive answer to the pain of long context: less compute, more memory and a smaller bill. If SubQ works beyond the demo, it could change the economics of agents, legal analysis and work across huge codebases. But 12 million tokens is not the same as 12 million tokens of understanding. The win will not be the size of the window. It will be whether the model can find the right detail in the noise and use it well.

#efficiency #infrastructure #ai #models

2026-05-01

04:53 · source ↗

Coding agents leave the IDE: Codex and Claude show what comes after programming

Latent Space AINews observes a shift they call "breaking containment": coding agents like Codex and Claude are no longer just tools for writing code but are expanding into knowledge work and creative workflows broadly.

A coding agent that stops being bounded by code is not a bigger IDE. It is a work entity without a natural checkpoint. Organizations that deploy it as a productivity tool without matching governance get outputs nobody approved.

#agents #ai #models #coding #commentary #podcast

2026-04-28

00:00 · source ↗

OpenAI layers ChatGPT safety from model to abuse detection, but the numbers are missing

OpenAI outlines its layered approach to ChatGPT community safety: model safeguards, abuse detection, policy enforcement, and collaboration with external safety experts.

A safety commitment from a platform with half a billion users is a necessary condition, not a guarantee. The guarantee will come the day OpenAI publishes incident numbers that actually surprise you.

#openai #ai #models #policy #security

2026-04-23

00:00 · source ↗

OpenAI pays up to $25,000 for bio jailbreaks in GPT-5.5, but proof will be in aggregate results

OpenAI launches a bio bug bounty targeting universal jailbreaks in GPT-5.5, with rewards up to $25,000 for critical biological safety findings.

A bio safety bounty is a good step. But impact is measured by what OpenAI does with the findings after the deadline, not by how much it pays for the discovery.

#openai #ai #models #security

2026-04-21

12:00 · source ↗

ChatGPT Images 2.0 finally handles text in graphics, but production needs independent testing

ChatGPT Images 2.0 brings improved image generation focused on text accuracy in graphics, multilingual support, and advanced visual reasoning for production workflows.

Text in graphics was the giveaway that an image was machine-made. Once that stops holding, content management and legal teams will need to rethink what they are actually verifying.

#openai #ai #models #multimodal

2026-04-15

12:07 · source ↗

VAKRA benchmark reveals where agents actually fail: tool selection, arguments, multi-step planning

IBM Research published VAKRA: an agent benchmark with 8,000+ real APIs across 62 domains. It evaluates full execution trajectories, not just final answers. Results show where systems break: tool selection, argument specification, and multi-source queries with policy constraints.

Finally a benchmark that measures agent failures where they actually happen: not in the final answer, but at every intermediate step. If the results correlate with production behavior, VAKRA becomes the diagnostic tool agent developers need.

#agents #huggingface #ai #open-source

2026-01-20

11:00 · source ↗

Cisco deployed Codex for enterprise defect fixes, but hard numbers are still missing

Cisco and OpenAI describe deploying Codex as an agent in enterprise engineering workflows: build automation, defect fixes, and a shift toward agent-native development.

A vendor partner case study is a marketing genre, not a documentary study. Either Cisco publishes hard numbers, or this was a PR piece with a Cisco logo in the headline.

#agents #openai #ai #coding

2025-12-18

00:00 · source ↗

GPT-5.2-Codex targets long-horizon refactors, proof will be independent production tests

GPT-5.2-Codex targets long-horizon coding tasks across large context: large-scale code transformations, security fixes, and multi-file consistency.

A long-horizon coding agent sounds like the future. But every senior engineer who runs it on a large refactor without review will discover the model is confident even when it is wrong.

#openai #ai #models #coding #security

2025-12-16

09:00 · source ↗

FrontierScience tests AI scientific reasoning, but a lab's own benchmark needs independent audit

OpenAI introduces FrontierScience: a benchmark for scientific reasoning tasks in physics, chemistry, and biology, focused on reasoning processes rather than factual recall.

A benchmark from a research lab for its own model is like a PhD candidate who grades their own exam. Proof of real scientific utility will come from acceptance by independent scientists, not from the PR team.

#openai #benchmarks #ai

2025-11-19

00:00 · source ↗

GPT-5.1-Codex-Max system card is worth reading, but trust it in proportion to its limits specificity

The GPT-5.1-Codex-Max system card describes two safety layers: model-level safety training and prompt injection protection, and product-level sandboxing with configurable network access.

A system card is trustworthy to the degree it is specific about its limitations. A document with more mitigations than known limitations tells you more about the PR team than about the model.

#agents #openai #ai #models #coding #security

2025-11-18

00:00 · source ↗

Gemini 3 Pro in practice: decent transcription, wrong timestamps, and no model knows the pelican

Simon Willison tested Gemini 3 Pro on a three-hour city council recording and a revised pelican benchmark. Result: a structured transcript for $1.42, but timestamps are off by tens of minutes. And none of the models tested understood that a California brown pelican is not actually brown.

Gemini 3 Pro transcribed a three-hour recording for under a dollar and a half, and that is a real finding. Timestamps off by tens of minutes and a pelican that does not know its own color are a signal that cheap transcription and accurate transcription are still two different things.

#benchmarks #ai #models #multimodal #simonwillison #commentary

2025-11-06

00:00 · source ↗

Async coding agents as research threads: fire a task, get a pull request back

Simon Willison describes a fire-and-forget workflow with Claude Code, Codex and other coding agents: pose a research question, the agent works on a server and files a pull request. Code is proof of feasibility, not just text.

Willison shows that an agent does not have to write production code to be useful. It just needs to come back with a PR that tells you whether something is feasible or not. That shift from an editor loop to an async research thread may be a bigger change than it looks.

#agents #ai #models #coding #simonwillison #commentary

2025-11-02

00:00 · source ↗

Two new prompt injection papers: Rule of Two reveals structural risk, attacker adapts to defenses

Simon Willison highlighted two new papers on agent prompt injection. Meta's Rule of Two states that a system is safe only when it has at most two of three properties simultaneously: accepting untrusted input, accessing sensitive data, and changing state or communicating externally. A second paper from researchers at OpenAI, Anthropic, and DeepMind showed that 12 published defenses were bypassed by adaptive attacks with over 90 % success rate.

Prompt injection is not a filter problem. It is an architecture problem. An agent that simultaneously reads untrusted content, holds sensitive data, and can act is compromised before you start thinking about detection.

#agents #ai #security #simonwillison #commentary

2025-10-29

00:00 · source ↗

OpenAI opens policy-based content classification with open-weight safeguard models

OpenAI released gpt-oss-safeguard-120b and 20b: open-weight reasoning models where content classification policy is not baked into the weights but supplied at runtime. Organizations bring their own rules; the model reasons over them.

Policy-as-input is architecturally cleaner than a one-size-fits-all moderator. But architectural cleanliness is not security: a model that can reason over your rules can reason just as well over the rules someone else slips in.

#openai #benchmarks #ai #models #policy #security

2025-10-23

18:40 · source ↗

Gemini 2.5 Computer Use: DeepMind builds a dedicated model for agents that click instead of calling an API

Google DeepMind released Gemini 2.5 Computer Use in preview: a specialized model for agents that drive user interfaces. Unlike general-purpose Gemini 2.5 Pro, this model was trained specifically for screen interaction, not just reasoning about it.

A computer-use agent in an enterprise environment is not just a productivity tool. It is an entity clicking under your identity in systems you designed for humans. A security model that does not account for that from the start is just a matter of time.

#agents #research #deepmind #ai #models #google