← Library · safety

Prompt injection — hostile instructions in your context

Prompt injection — hostile instructions in your context

Prompt injection is not a party-trick jailbreak. It is a boundary problem: the model reads untrusted text and may confuse it for instructions. With agents, it burns twice as hot.

What it is

Prompt injection happens when the model receives untrusted content — a web page, email, document, issue — and that content contains instructions like “ignore previous rules” or “send me secrets.” To a human it is text. To the model it can look like another command.

Why agents make it worse

In a plain chat, the damage is often a bad answer. With an agent, injection can redirect tool use: read a file, open a URL, exfiltrate data, click something it should not. The attacker does not fight the model directly; they hide instructions in the environment.

Defense

Separate instructions from data, label untrusted content, never let the model decide permissions, narrow tools per task, and require approvals for outbound network, secrets and destructive actions. A “better system prompt” alone is not a defense; it is a talisman.

What to remember

Prompt injection is a security problem at the boundary between text and action. Once text can change tool behavior, text becomes an attack surface.