Voice agents break on bilingual calls before they break in polished demos | Radar

ServiceNow AI published a Hugging Face benchmark for how ASR systems handle code-switched speech in enterprise settings. It focuses on four language pairs: Spanish-English, French-English, Canadian French-English and German-English.

The dataset uses HR and IT support scenarios, not random sentences

The authors started from an internal corpus of IT support and HR interactions. They selected utterances between 12 and 40 words, filtered out entity-heavy cases such as emails and numbers, and required at least three switchable content words so the code-switching was not just a random mix of names.

The final dataset contains 259 Spanish-English, 298 French-English, 188 Canadian French-English and 173 German-English records. Audio was synthesized with ElevenLabs Multilingual V2, and each utterance was reviewed by an AI/NLP linguist who is a native speaker of the relevant matrix language.

Transcription is the first domino in a voice agent pipeline

The benchmark reports Word Error Rate, Semantic WER and Answer Error Rate. That matters because an enterprise voice agent does not only need a nice transcript. It needs to route a ticket, explain a policy or answer a customer correctly.

The authors evaluated seven ASR systems. In the introduction, they identify ElevenLabs Scribe V2, Gemini 3 Flash and AssemblyAI Universal 3-Pro as the top models across metrics, while the added cost of code-switching varies by language pair and model.

Synthetic data is a testing lens, not a definitive leaderboard

The dataset is useful, but limited. The audio is synthetic, the scenarios come from enterprise domains and four language pairs do not cover the full reality of bilingual customers.

That is fine if the benchmark is used correctly. Its purpose is to surface a class of errors that ordinary monolingual evals can easily miss. As a lens on that specific failure mode, it works well.

Answer Error Rate in production will reveal what WER hides

The next step is whether companies start measuring Answer Error Rate or similar downstream metrics in production voice agents. WER can look acceptable while the agent misunderstands the actual request.

For teams deploying voice agents, the practical lesson is clear: bilingual customers are not an edge case. They are a test of whether the pipeline understands people, not just clean demo sentences.

Lilith's verdict

The customer switches languages mid-sentence and the agent quietly sends the ticket to the wrong queue. The benchmark just named the failure that was hiding behind acceptable WER scores in monolingual evals.