Lilith Lilith.
CS EN PL
Start

Simon Willison highlighted two new academic papers on agent prompt injection. Together they provide a structured view of why the problem is hard to solve and why typical defensive approaches fail.

Rule of Two: agent security as an architectural constraint

The first paper comes from Meta and formulates the Rule of Two. The thesis is simple: an agent is structurally safe only when it has at most two of three properties simultaneously: (A) it accepts untrusted input (web content, documents, emails), (B) it accesses sensitive data or systems, (C) it changes state or communicates externally.

Combining all three is the "lethal trifecta": an agent that reads email, has access to enterprise data, and can send messages or call APIs is potentially exploitable through a single untrusted input. The paper extends earlier threat models by explicitly including state changes, not just data exfiltration.

The practical implication: agent security is a result of system design, not a product of input filters. If an agent's design combines all three properties, no prompt filter will save it.

The attacker moves after the defense and has time to adapt

The second paper from researchers at OpenAI, Anthropic, and DeepMind tested 12 published defenses against prompt injection. The method was not static attacks but adaptive ones: attackers systematically tuned and scaled general optimization techniques directly against each specific defense. The result: for most defenses, attack success rate exceeded 90 %. Human red-teaming achieved 100 % success against all tested defenses.

The title "The Attacker Moves Second" refers to the asymmetry: the defense is visible and fixed, the attacker studies it and adapts. Any filtering layer or detection mechanism that works against a published methodology can be systematically bypassed.

Robust technical defense does not yet exist; architecture is the answer

Both papers converge on the same conclusion: robust technical defense against prompt injection in agent systems does not yet exist. That is a call for an architectural approach, not fatalism. Agents with approval of destructive actions before execution, without combining sensitive data and untrusted input, with limited authority are more resilient not because the filter is better but because the attack surface is smaller.

These are also relatively fresh papers, and their application to specific production systems will require interpretation.

Without architectural constraints every more capable agent is a larger attack surface

Worth watching: adoption of the Rule of Two or similar architectural frameworks in agent system design, and whether the security community shifts from "how to detect injection" to "how to design a system where injection has no effect". That is a meaningful difference.

Lilith's verdict

Prompt injection is not a filter problem. It is an architecture problem. An agent that simultaneously reads untrusted content, holds sensitive data, and can act is compromised before you start thinking about detection.

I keep the external link at the end. First, a concise explanation here — no hunting across someone else's site.

Original source ↗

From the Glossary