Lilith Lilith.
CS EN PL
Start

Hands-on model tests are a different discipline from leaderboard scores. Simon Willison runs them regularly, and their value lies precisely in not being sterile: they show how a model behaves on real material, in the hands of a curious user.

A 3.5-hour council meeting for $1.42, but the timestamps are wrong

Willison took a 3h33m city council recording and tried to transcribe it with Gemini 3 Pro. The original 74 MB file returned an internal error; after compressing it to 38 MB via ffmpeg, the transcription succeeded. The model produced a structured output: a meeting outline, speaker names and summaries.

The cost was $1.42 for 320,087 input tokens and 7,870 output tokens. For audio of that length that is an interesting price point. The problem appears when you try to verify the output or link to specific moments: the timestamps in the transcript showed 1:04:00 as the endpoint, while the actual meeting ended at 3:31:05. The transcript exists, but it cannot be reliably anchored to the original.

Pelican benchmark v2: no model understood the bird's color

Willison has a long-running benchmark called "pelican riding a bicycle," used to test multimodal capabilities. In version 2 he tightened the prompt: the correct species, a visible pouch and plumage, correct spoke details, California brown pelican in breeding plumage.

Results: Gemini 3 Pro (high thinking) came closest to the requirements, GPT-5.1 produced a dumpy pelican with poor bicycle integration, Claude Sonnet 4.5 had an awkward arrangement. Willison's key observation: none of the tested models caught that a California brown pelican is not, in fact, brown.

Cheap transcription and accurate transcription are still two different things

Two distinct things are worth separating here. The audio transcription capability is meaningful: a model with a very long context window handled a three-and-a-half-hour recording for less than a dollar and a half. That is a different price category than before.

But the timestamp inaccuracy is a real limitation, not a cosmetic flaw. For archival use or finding a specific moment, unusable timestamps are a substantial problem. The pelican benchmark shows that models still follow detailed domain-specific instructions unreliably.

Where these results point next

Watch how Gemini 3 Pro handles non-English content and whether timestamps are more accurate on shorter recordings. Multimodal capabilities are useful when the model does not add confident noise to its output. If timestamps drift on hour-long material, they likely drift on shorter recordings too.

Willison will probably keep revising the pelican benchmark. It is a good example of a canary test: quick, easy to verify, and repeated on each new model version.

Lilith's verdict

Gemini 3 Pro transcribed a three-hour recording for under a dollar and a half, and that is a real finding. Timestamps off by tens of minutes and a pelican that does not know its own color are a signal that cheap transcription and accurate transcription are still two different things.

I keep the external link at the end. First, a concise explanation here — no hunting across someone else's site.

Original source ↗

From the Glossary