2026-05-06 · ← Radar
SubQ review: great numbers, but still a test of benchmark faith
Fello AI frames SubQ as the first subquadratic LLM with a 12-million-token context window and leads with several striking numbers: roughly 52x faster prefill than FlashAttention on 1M tokens, lower cost against frontier models and benchmarks that place the model near frontier class.
Fello AI: SubQ claims 52x faster prefill and a context window no one else has
This is precisely the combination that lights up every indicator at once in the AI world: an architectural change, an economic change and a practical use case for long context. The same combination is also ideal ground for overstated marketing claims.
The review presents SubQ as built on a subquadratic architecture with sparse attention, designed to escape the quadratic cost growth that makes very long context prohibitively expensive in standard transformers. The 52x prefill speed claim is compared against FlashAttention on one million token inputs. The benchmark positioning and cost comparisons against frontier models are from Fello AI's own testing and framing.
Larger context changes application design only where reasoning holds accuracy across the full input
The practical impact question is who will actually use this, where it removes work and where it merely adds another process layer. The use cases where long context would matter most are those where today's stacks use RAG as a crutch: full codebases, compliance documents, multi-file debugging, technical due diligence. If sparse attention genuinely preserves accuracy, the application design can simplify: less chunking, less brittle retrieval, more material directly in front of the model.
But long context is not a win by itself. A model must find the relevant piece of information in a long input, hold it through many inference steps and not overwrite it with a more fluent but wrong answer. Needle in a haystack benchmarks are necessary but not sufficient. An agent working in a real codebase or a legal document hits conflicting information, stale sections and small rules buried in dull text.
Numbers at that scale need independent replication before they change architecture decisions
A 52x speedup on prefill is a very large number. Claims at that scale need independent replication before they change architecture decisions. Fello AI is a single reviewer. The source is promotional in framing. What matters for production teams is whether the speed holds under real workload conditions, what the cost per token is in practice and how accuracy behaves in the middle of a million-token input where attention models typically degrade.
The benchmark positioning near frontier class is interesting but also the part that needs most scrutiny. Frontier models are evaluated on broadly agreed public benchmarks. A new architecture company reporting its own benchmarks is not the same thing.
Adoption in real teams will matter more than the benchmark
Watch developer access to SubQ, actual API pricing and adoption in teams doing the specific workloads where long context matters most: legal, compliance, large codebase work. If Subquadratic improves the infrastructure economics without sacrificing accuracy, the RAG stack for many products becomes simpler. That is a meaningful change.
The sober scenario is that the architecture is real, the economics improve, and the accuracy problems at long context remain as hard as before but cheaper to fail at. We will know which scenario is playing out when the first teams run SubQ against their actual production workloads and report back.
Lilith's verdict
If SubQ delivers, RAG teams will have an uncomfortable morning. If it does not, it will be another altar where the phrase 'revolutionary architecture' burned. Right now: interesting, sharp, unproven.
I keep the external link at the end. First, a concise explanation here — no hunting across someone else's site.
Original source ↗ ↗From the Glossary