What 400,000 phone calls taught us about who AI hears — and who it hangs up on.
The drive-thru is the hardest listening environment in commerce — engines idling, wind, every accent in America, all in a hurry. I've built voice AI for that. So when I show you what's inside these systems, it's not secondhand.
Every accent in America calls a transmission shop. The 20-year truck owner. The customer on her third language. Fast talkers, slow talkers, Brooklyn and Birmingham.
The question isn't whether AI is smart. It's whether it hears your customers.
Misheard → wrong answer → “Sorry, can you repeat that?” → repeat → repeat → click.
Stanford tested the speech engines behind Amazon, Apple, Google, IBM, and Microsoft — same phrases, different speakers.
Nearly double the errors. All five systems. Same words, same phrases — the gap is in the machine, not the caller. And today's newest models still show the gap.
And race? The big training sets don't even publish it. Stanford traced the racial gap straight back to the training data — the machines fail exactly the voices the data leaves out.
Imagine a tech who only ever rebuilt one transmission model.
Now put every car in America in his bay.
That's not malice. That's a training problem. And training problems can be fixed.
The Stanford researchers' #1 recommendation: measure accuracy for every group of speakers — not just the average. A system can score 90% overall and still fail your Thursday-morning regulars every single time.
You already run your shop this way.
Why would you accept an AI with no scoreboard? You can't fix what you don't measure.
They don't know they're hanging up on your customers. They have no way to know.
We don't grade ourselves on “sounded good.”
We see how the AI performs region by region — because we measure every call.
December 2025: our AI agents go live at Jonathan Tow's two AAMCO shops — Brooklyn and Garden City Park. First month: 4 in 10 appointment conversations end in a booking.
The scoreboard says the agents can do better. We retune them for how these customers actually talk. CONFIRM: what was changed
Every month since: 6+ in 10 booked — a 56% improvement over launch, across 1,383 agent calls. Outcome quality scores up too.
And it never stops: call patterns shift by region and by season — winter no-starts, summer overheats. The tuning is continuous, not a one-time fix. We only caught it because we look — that's the whole point of this talk.
A generic bot transcribes the call. A relatable agent carries the conversation — and callers regularly can't tell they're talking to AI. When they can't tell, they stay on the line.
Accuracy gets the words right. Relatability gets the car in the bay. You need both — and you should demand both.
None of that happens if the AI can't understand the caller. Every accent it mishears is a car that drives to the shop down the street.
Use these on us too. If we can't answer one, don't buy from us either.
AI is what you feed it and what you check. We're the company that checks.
If your AI doesn't have a scoreboard, it's hanging up on customers you'll never meet. Demand the scoreboard — from us or anyone.
Mahalo. — Questions? · olelo-ai.com · 5-question checklist at the booth
Koenecke et al., “Racial disparities in automated speech recognition,” PNAS, 2020 (Stanford). Five commercial engines (Amazon, Apple, Google, IBM, Microsoft); avg. word error rate 35% for Black speakers vs 19% for white speakers; >20% of Black speakers' snippets unusable (WER ≥ 50%) vs <2%; gap persisted on identical phrases.
“Evaluating OpenAI's Whisper ASR,” JASA Express Letters, 2024. Significant performance differences across accents persist in modern models.
“Self-supervised speech models still struggle with AAVE,” arXiv 2408.14262, 2024. Elevated error rates on African American Vernacular English in wav2vec 2.0 / HuBERT / Whisper-family models.
Whisper UK regional-dialect adaptation, arXiv 2501.08502, 2025. Off-the-shelf models show elevated error on regional dialects; dialect-specific fine-tuning closes much of the gap.
ASR bias survey, arXiv 2211.09511; data/predictive bias, arXiv 2202.12603. Training-data distribution identified as root cause; cohort-level evaluation recommended.
Garnerin et al., LREC 2020. Audit of 66 open speech corpora: real-world (“found”) speech runs 68.1% male / 31.9% female speakers. Mozilla Common Voice reports — gender-labeled English clips skew roughly 3:1–4:1 male. UNESCO, “I'd blush if I could,” 2019 — women are 12% of AI researchers. Note: no major corpus publishes a racial breakdown; the racial-gap link to training data is the PNAS authors' attribution, not a counted statistic.
Martin & Tang, Interspeech 2020 — AAVE grammatical features drive elevated error rates. JAMIA Open 2024 — clinical transcription worse for Black patients. Pacific Northwest corpus study, 2025 — largest errors for African American speakers across all commercial systems tested.
Zendesk CX Trends (industry survey). 74% of customers report repeating themselves as a top frustration.
Olelo platform data (as of June 2026) — calls processed, invoice-matched calls, configurations. Olelo Revenue Recovery Analysis, January 2026 — per-location recovered revenue and ROI.