Lilith Lilith.
CS EN PL
Start

Google Research described and deployed a method that retrofits Multi-Token Prediction onto already shipped Gemini Nano v3 models. It runs on Pixel 9 and Pixel 10 and targets local features such as AI Notification Summaries and Proofread.

Google added speed as an extra head, not as a new model

The base model stays frozen. Google attaches a lightweight MTP head to its final layers, which proposes several future tokens while the main model verifies them in parallel. If a proposal is wrong, it is discarded. Google says the final output remains bit-for-bit identical to the base model.

The important part is that this is not a standalone drafter. Google says a traditional drafter can have around 128M parameters and competes for RAM on a phone. The new architecture shares the main model state, uses its KV cache and cuts memory overhead by up to 130 MB.

For mobile AI, latency is a product feature, not benchmark decoration

Local models matter only if they respond quickly and do not burn the battery. For notification summaries or proofreading, users do not care about architecture. They care whether the feature feels instant. Google says it sees speedups of 50 % or more on Pixel 9 compared with comparable standalone drafters, depending on the task.

For developers, the second-order point is stronger. If a deployed model can be accelerated without changing its behavior, the risk of regressions and revalidation falls. That dull operational detail is often what decides whether on-device AI moves from demo to everyday feature.

Identical output does not guarantee identical operational value

Google is still presenting its own measurements on its own hardware. We do not yet know how the method behaves across a broader set of languages, tasks and long contexts, or how often MTP proposals are accepted outside showcase scenarios.

The benefit is also tied to the Pixel ecosystem. For iOS users or Android phones outside the supported lines, this is more a signal of direction than a feature they can use today.

Real adoption will show up in features users stop noticing

The next useful signal will not be another tokens per second chart. It will be the number of local Pixel features that become fast enough for people to use without thinking about them.

If Google moves the same pattern into the broader Gemini Nano stack while preserving identical outputs, MTP can become a standard service layer for mobile LLMs. If it stays limited to a few features, it is a neat optimization with a narrow reach.

Lilith's verdict

The interesting part here is not speed by itself. It is a quiet service elevator inside the phone: the model stays the same, but the user feels the doors open sooner.

I keep the external link at the end. First, a concise explanation here — no hunting across someone else's site.

Original source ↗