Lilith Lilith.
CS EN PL
Start
2026-05-11
17:19 · source ↗

SocialReasoning-Bench: the agent completes the task but fails to improve the user's position

Microsoft Research describes SocialReasoning-Bench, a benchmark testing whether AI agents genuinely act in the user's best interest. Key finding: agents complete tasks technically, but do not consistently improve outcomes for the person, even when explicitly instructed to.

An agent that can click is not yet a user advocate. The real test starts when someone needs a better contract, not just a neatly completed form.

2026-05-08
2026-05-07
17:56 · source ↗

Mozilla fixed hundreds of Firefox bugs with Claude Mythos. AI security report quality just shifted.

Simon Willison described how Mozilla used early access to Claude Mythos Preview to systematically find and fix Firefox vulnerabilities. In April 2026 the number of fixed security bugs jumped to 423, compared to the usual 20 to 30 per month. The key shift: AI security reports stopped being noise and started being usable input.

A 20-year-old Firefox bug fixed by an AI agent is not a marketing story. It is proof that security auditing can scale to parts of the codebase humans never reached. What remains is finding out who can repeat this without privileged preview access.

2026-05-06
10:43 · source ↗

AlphaEvolve finds algorithms in days that teams spent months on, with production numbers

DeepMind introduced AlphaEvolve as a Gemini-powered evolutionary loop that automatically discovers better algorithms. Concrete production results: 30 % fewer errors in genomics, 20 % lower write amplification for Spanner, Klarna doubled transformer training speed.

AlphaEvolve does not help a programmer write. It searches the solution space and returns runnable code. The first team to point it at a problem they did not know could be automated gains an asymmetric edge.

01:49 · source ↗

SubQ review: great numbers, but still a test of benchmark faith

Fello AI reviews SubQ's claims: 12M token context window, 52x faster prefill than FlashAttention on 1M tokens and frontier-class benchmark positioning. The numbers are striking enough to need independent verification before they change architecture decisions.

If SubQ delivers, RAG teams will have an uncomfortable morning. If it does not, it will be another altar where the phrase 'revolutionary architecture' burned. Right now: interesting, sharp, unproven.

2026-05-05
12:00 · source ↗

Subquadratic raises $29M for 12M-token context windows

Subquadratic has launched with $29 million in seed funding and introduced SubQ, a model built on a subquadratic architecture and sparse attention to push context windows as high as 12 million tokens. The promise is longer context, higher speed, better accuracy and lower cost. The proof still needs independent benchmarks.

Subquadratic is selling a very attractive answer to the pain of long context: less compute, more memory and a smaller bill. If SubQ works beyond the demo, it could change the economics of agents, legal analysis and work across huge codebases. But 12 million tokens is not the same as 12 million tokens of understanding. The win will not be the size of the window. It will be whether the model can find the right detail in the noise and use it well.

2026-05-01
04:53 · source ↗

Coding agents leave the IDE: Codex and Claude show what comes after programming

Latent Space AINews observes a shift they call "breaking containment": coding agents like Codex and Claude are no longer just tools for writing code but are expanding into knowledge work and creative workflows broadly.

A coding agent that stops being bounded by code is not a bigger IDE. It is a work entity without a natural checkpoint. Organizations that deploy it as a productivity tool without matching governance get outputs nobody approved.

2026-04-28
00:00 · source ↗

OpenAI layers ChatGPT safety from model to abuse detection, but the numbers are missing

OpenAI outlines its layered approach to ChatGPT community safety: model safeguards, abuse detection, policy enforcement, and collaboration with external safety experts.

A safety commitment from a platform with half a billion users is a necessary condition, not a guarantee. The guarantee will come the day OpenAI publishes incident numbers that actually surprise you.

2026-04-23
2026-04-21
2026-04-15
12:07 · source ↗

VAKRA benchmark reveals where agents actually fail: tool selection, arguments, multi-step planning

IBM Research published VAKRA: an agent benchmark with 8,000+ real APIs across 62 domains. It evaluates full execution trajectories, not just final answers. Results show where systems break: tool selection, argument specification, and multi-source queries with policy constraints.

Finally a benchmark that measures agent failures where they actually happen: not in the final answer, but at every intermediate step. If the results correlate with production behavior, VAKRA becomes the diagnostic tool agent developers need.

2026-01-20
2025-12-18
2025-12-16
09:00 · source ↗

FrontierScience tests AI scientific reasoning, but a lab's own benchmark needs independent audit

OpenAI introduces FrontierScience: a benchmark for scientific reasoning tasks in physics, chemistry, and biology, focused on reasoning processes rather than factual recall.

A benchmark from a research lab for its own model is like a PhD candidate who grades their own exam. Proof of real scientific utility will come from acceptance by independent scientists, not from the PR team.

2025-11-19
00:00 · source ↗

GPT-5.1-Codex-Max system card is worth reading, but trust it in proportion to its limits specificity

The GPT-5.1-Codex-Max system card describes two safety layers: model-level safety training and prompt injection protection, and product-level sandboxing with configurable network access.

A system card is trustworthy to the degree it is specific about its limitations. A document with more mitigations than known limitations tells you more about the PR team than about the model.

2025-11-18
00:00 · source ↗

Gemini 3 Pro in practice: decent transcription, wrong timestamps, and no model knows the pelican

Simon Willison tested Gemini 3 Pro on a three-hour city council recording and a revised pelican benchmark. Result: a structured transcript for $1.42, but timestamps are off by tens of minutes. And none of the models tested understood that a California brown pelican is not actually brown.

Gemini 3 Pro transcribed a three-hour recording for under a dollar and a half, and that is a real finding. Timestamps off by tens of minutes and a pelican that does not know its own color are a signal that cheap transcription and accurate transcription are still two different things.

2025-11-06
00:00 · source ↗

Async coding agents as research threads: fire a task, get a pull request back

Simon Willison describes a fire-and-forget workflow with Claude Code, Codex and other coding agents: pose a research question, the agent works on a server and files a pull request. Code is proof of feasibility, not just text.

Willison shows that an agent does not have to write production code to be useful. It just needs to come back with a PR that tells you whether something is feasible or not. That shift from an editor loop to an async research thread may be a bigger change than it looks.

2025-11-02
00:00 · source ↗

Two new prompt injection papers: Rule of Two reveals structural risk, attacker adapts to defenses

Simon Willison highlighted two new papers on agent prompt injection. Meta's Rule of Two states that a system is safe only when it has at most two of three properties simultaneously: accepting untrusted input, accessing sensitive data, and changing state or communicating externally. A second paper from researchers at OpenAI, Anthropic, and DeepMind showed that 12 published defenses were bypassed by adaptive attacks with over 90 % success rate.

Prompt injection is not a filter problem. It is an architecture problem. An agent that simultaneously reads untrusted content, holds sensitive data, and can act is compromised before you start thinking about detection.

2025-10-29
00:00 · source ↗

OpenAI opens policy-based content classification with open-weight safeguard models

OpenAI released gpt-oss-safeguard-120b and 20b: open-weight reasoning models where content classification policy is not baked into the weights but supplied at runtime. Organizations bring their own rules; the model reasons over them.

Policy-as-input is architecturally cleaner than a one-size-fits-all moderator. But architectural cleanliness is not security: a model that can reason over your rules can reason just as well over the rules someone else slips in.

2025-10-23
18:40 · source ↗

Gemini 2.5 Computer Use: DeepMind builds a dedicated model for agents that click instead of calling an API

Google DeepMind released Gemini 2.5 Computer Use in preview: a specialized model for agents that drive user interfaces. Unlike general-purpose Gemini 2.5 Pro, this model was trained specifically for screen interaction, not just reasoning about it.

A computer-use agent in an enterprise environment is not just a productivity tool. It is an entity clicking under your identity in systems you designed for humans. A security model that does not account for that from the start is just a matter of time.