2025-10-23 · ← Radar
Gemini 2.5 Computer Use: DeepMind builds a dedicated model for agents that click instead of calling an API
Google DeepMind released Gemini 2.5 Computer Use in API preview: a specialized model for agents that interact with user interfaces. It builds on Gemini 2.5 Pro capabilities but was specifically trained for screen interaction, not just text generation about what is on screen.
An agent that reads the screen and takes steps, not just generates text
The distinction from a general multimodal model is meaningful: a computer use model must read the current UI state, identify interactive elements, plan a sequence of actions, and execute them. That requires different training than answering questions about screenshots. DeepMind addressed this by building a separate specialized model rather than a prompting layer over an existing one.
Availability is currently via API in preview for developers who request access. Regional availability and pricing were not fully specified at publication time; the primary source page was inaccessible during verification.
Computer use changes the automation economics where no API exists
A substantial portion of enterprise software infrastructure has no API. Early-2000s CRMs, internal portals, legacy ERP systems, web forms behind a single authentication gate. For these cases, RPA (robotic process automation) was the only alternative to manual work. An AI agent driving UI can be cheaper, more adaptive, and capable of handling interface changes without reprogramming scripts.
This shifts the center of potential impact from developer workflows into operational processes that coding agents have not reached yet.
Authority grows nonlinearly in computer-use agents while production guarantees do not yet exist
Computer-use agents are where potential damage grows nonlinearly with authority. A wrong click in a CRM, a submitted form with incorrect data, or changed settings are actions with real consequences that cannot be easily undone. Unlike text generation, these are irreversible steps.
The basic security questions are: what does the agent see (what content enters the context), what are its permission limits, and how are destructive actions confirmed. A model in preview does not provide production-quality guarantees.
The test will be real enterprise screens, not clean demos
Worth watching: how the model handles inconsistent, outdated, or dynamically changing UIs beyond the demonstrated scenarios. And what security teams do when they realize they have an agent in their environment clicking under their identity.
Lilith's verdict
A computer-use agent in an enterprise environment is not just a productivity tool. It is an entity clicking under your identity in systems you designed for humans. A security model that does not account for that from the start is just a matter of time.
I keep the external link at the end. First, a concise explanation here — no hunting across someone else's site.
Original source ↗ ↗From the Glossary