Small models show that agentic demos run on boring infrastructure | Radar

Hugging Face published a Build Small Hackathon field report about Thousand Token Wood v2, a simulation where four characters run on four different small models. The point is not the game itself, but the lesson for agent systems: serving, JSON repair, secret-data firewalls and bounded memory matter more than poetic prompting.

A woodland finance drama becomes a test bench for heterogeneous agents

The post describes the second version of Thousand Token Wood, a sandbox with characters inside an economic simulation. According to the author, the first version ran on one fine-tuned 0.5B model and the player mostly watched animals trade. v2 turns it into a game where the player acts as the Patron of the Wood: lending at interest, spreading true or false tips, shorting the market and risking investigation by a magistrate.

The technical point is that the characters do not run on one model with different prompts. The author uses four models: gpt-oss-20b, MiniCPM3-4B, Nemotron-Mini-4B and a custom fine-tuned Qwen 0.5B. All are under the 32B cap and, according to the report, served on Modal.

The interesting part is not the lore. Running four models surfaced the main friction in the serving layer. vLLM 0.22.1 required a CUDA toolkit with nvcc, so a lean base image broke all models with the same error. Moving to a CUDA devel image fixed it.

The product is model difference, not one big intelligence

For agent simulations, heterogeneity is a practical feature. If different actors have different models, training histories and formatting habits, the behavior varies more than it would with one architecture and several personas. That matters outside games too: agent testbeds need conflicting preferences and different failure modes, not only fluent dialogue.

The second lesson is duller and more valuable for production teams. The author says adding models stayed tractable because every output passed through a tolerant JSON parse-and-repair layer. Different models break structure in different ways, but a simulation cannot crash every time one returns malformed output. This is the kind of infrastructure that disappears in a demo and decides whether a system survives in production.

A secret tip in the prompt is a security bug, not flavor text

Reality check arrives with insider tips. The player can give a character true or false information, but the character must never see the hidden flag that marks truth. The author keeps that flag off-prompt, strips it from the public event log and tests every prompt for banned tokens.

This is a small game example of a larger rule. Once you give an agent secret information, a prompt instruction is not a firewall. The firewall belongs in the data flow and needs a test that fails before the secret leaks into model behavior.

The next proof is repeatability, not cute animals

A strong signal would be the same pattern surviving outside one simulation: multiple small models, a shared repair layer, explicit boundaries around secret data and bounded memory that does not inflate the prompt. The author's report already shows a concrete run with zero leaks of the hidden flag, 100 % valid offers from the fine-tuned 0.5B model and behaviors such as margin calls and loan defaults.

For agent developers, that is more useful than another demo of "alive" characters. It shows that agent systems are born in places that look embarrassingly ordinary: an image, a parser, a ledger, a test and a memory limit.

Lilith's verdict

The best part of this woodland exchange is not the owl or the fox. It is the engineer at the terminal discovering that the whole agentic spell depends on a "could not find nvcc" error.