An honest talk about the machines answering America's phones

IS AI RACIST?

The truth behind the tech.

What 400,000 phone calls taught us about who AI hears — and who it hangs up on.

Olelo Intelligence
Miki Hardisty  ·  CEO, Olelo Intelligence
01INTERNAL DRAFT — PENDING FOUNDER CONFIRMATION
Your speaker

Teaching machines to listen —
since before the AI wave.

Miki Hardisty
Miki Hardisty
Co-Founder & CEO
Olelo Intelligence
Jack in the Box
Global Corporate CTO — built one of the industry's first conversational-AI drive-thrus, years before the AI wave
Dell
Global enterprise architecture & big-data strategy at world scale
LPL Financial
SVP — launched the firm's first data science team; early machine learning in financial services
ProService Hawaii
CTO & COO — technical strategist on the Silver Lake investment deal
Olelo
Today: writes Olelo's AI engine herself — patent-pending call-scoring methodology, 400,000+ calls analyzed

The drive-thru is the hardest listening environment in commerce — engines idling, wind, every accent in America, all in a hurry. I've built voice AI for that. So when I show you what's inside these systems, it's not secondhand.

02Olelo Intelligence
The stakes

Your callers don't all sound the same.

BrooklynAtlantaHoustonEl PasoBoiseBirminghamQueens

Every accent in America calls a transmission shop. The 20-year truck owner. The customer on her third language. Fast talkers, slow talkers, Brooklyn and Birmingham.

$1,200+
average job riding on every single call that rings your counter
15–20
calls a busy shop misses every week — each one is a car that may never reach your bay

The question isn't whether AI is smart. It's whether it hears your customers.

03Olelo Intelligence
Heard it with your own eyes

What the caller said vs. what the AI heard

What the caller said
What a generic AI heard
“My transmission's slipping when she shifts into third.”
“My transition is slipping when sheets into bird.”
“It's the '09 Silverado — the work truck.”
“It's the oh nine silver auto, the word truck.”
“Can y'all look at it Thursday?”
“Can you book it there stay?”
Representative examples of error patterns documented in speech-recognition research.

Misheard → wrong answer → “Sorry, can you repeat that?” → repeat → repeat → click.

74%
of customers say repeating themselves is their #1 frustration (Zendesk CX industry survey)
04A hang-up never shows up in your call log as a lost job. But it is one.
The receipts

Researchers measured it. The answer is yes.

Stanford tested the speech engines behind Amazon, Apple, Google, IBM, and Microsoft — same phrases, different speakers.

White speakers — words misheard19 / 100
Black speakers — words misheard35 / 100

Nearly double the errors. All five systems. Same words, same phrases — the gap is in the machine, not the caller. And today's newest models still show the gap.

Also misheard
Southern accents
New York speech
Non-native English
The machines weren't trained on how America actually sounds. Sound familiar? That's your customer list.
05Koenecke et al., PNAS 2020 (Stanford) — gap still measurable in 2024–2026 studies (full sources: appendix)
Why this happens

Look at who taught the machines to listen.

2 of 3
voices in real-world speech collections are male — audit of 66 training corpora
~3 of 4
clips in the largest crowd-sourced voice dataset: male (Mozilla Common Voice)
12%
women among the AI researchers who build these systems (UNESCO)

And race? The big training sets don't even publish it. Stanford traced the racial gap straight back to the training data — the machines fail exactly the voices the data leaves out.

Imagine a tech who only ever rebuilt one transmission model.
Now put every car in America in his bay.

That's not malice. That's a training problem. And training problems can be fixed.

06Garnerin et al., LREC 2020 · Mozilla Common Voice · UNESCO 2019
Act II — The Fix

The fix isn't a smarter robot.
It's a scoreboard.

The Stanford researchers' #1 recommendation: measure accuracy for every group of speakers — not just the average. A system can score 90% overall and still fail your Thursday-morning regulars every single time.

ARO Car count Close rate by advisor

You already run your shop this way.

Why would you accept an AI with no scoreboard? You can't fix what you don't measure.

07Olelo Intelligence
Buyer beware

Why most “AI receptionists” can't fix this

Thin shell on a rented engine

  • Same five speech engines from the Stanford study under the hood
  • A shell can't retune the engine
  • No accuracy scores by region. No testing across accents. No tie to your repair orders.

An instrumented system — gauges on every corner

  • Every call measured: what was said, what was understood, what was booked
  • Performance visible by region and location — gaps get caught
  • Closed loop to the repair order — graded on cars, not vibes

They don't know they're hanging up on your customers. They have no way to know.

08Olelo Intelligence
What a scoreboard looks like

400,000 calls. Closed loop. City by city.

400K+
calls processed for shops like yours — about 50,000 a month
110K+
calls matched to actual repair orders — graded on cars that hit your bays
100+
live voice-agent configurations — each location tuned to its market
[SCREEN GRAB — Ed: platform regional performance view. Every call scored — all 46 of Tuesday's calls, not the one somebody cherry-picked.]

We don't grade ourselves on “sounded good.”
We see how the AI performs region by region — because we measure every call.

09Platform data as of June 2026
Continuous refinement — our scoreboard at work

Two New York shops. Live data.
Watch what tuning does.

1

December 2025: our AI agents go live at Jonathan Tow's two AAMCO shops — Brooklyn and Garden City Park. First month: 4 in 10 appointment conversations end in a booking.

2

The scoreboard says the agents can do better. We retune them for how these customers actually talk. CONFIRM: what was changed

3

Every month since: 6+ in 10 booked — a 56% improvement over launch, across 1,383 agent calls. Outcome quality scores up too.

Booked, when an appointment came up
Launch month40%
After tuning (months 2–6)63%
Olelo platform data, Dec 2025 – May 2026 · 1,680 AI-agent calls, two NY locations

And it never stops: call patterns shift by region and by season — winter no-starts, summer overheats. The tuning is continuous, not a one-time fix. We only caught it because we look — that's the whole point of this talk.

10DRAFT — confirm "what we changed" wording with Miki/Alisa before presenting
The other half of accuracy

Understood gets them heard.
Relatable gets them booked.

A generic bot transcribes the call. A relatable agent carries the conversation — and callers regularly can't tell they're talking to AI. When they can't tell, they stay on the line.

Remembers the last call
“How'd that brake job hold up?” — context from prior calls shapes the next step
Books — then confirms by text
Appointment goes on the schedule and the confirmation text goes out. No-shows drop, no extra work for your team
Follows up outbound
Declined work and unfinished estimates get a call back — the leak gets chased, not logged

Accuracy gets the words right. Relatability gets the car in the bay. You need both — and you should demand both.

11Olelo Intelligence
Revenue at risk — what hearing every customer is worth

Accuracy isn't a feature. It's the revenue.

BROOKLYN AAMCO
322 calls
taken by the AI agent in two months — nights, weekends, overflow — on track for ~$450K/year in recovered work
$125K/yr
median recovered revenue per location across shops on the platform
20x+
return on what the platform costs a shop per year

None of that happens if the AI can't understand the caller. Every accent it mishears is a car that drives to the shop down the street.

12Olelo Revenue Recovery Analysis, January 2026 · Brooklyn figures from platform data
Take this with you

Five questions for anyone selling you AI

“Show me your accuracy by region and accent — not your average.”
If they only have one number, the failures are hidden inside it.
“When the AI gets a caller wrong, how do you find out?”
Hope is not a system.
“Can you retune it for MY market — and have you ever actually done it?”
Ask for the story.
“Do you tie calls to my repair orders, or just count answered calls?”
Answered ≠ booked ≠ in the bay.
“What happens to the caller the AI can't understand?”
Dead end — or a path to a human?

Use these on us too. If we can't answer one, don't buy from us either.

13Printed card + QR download at the booth
So — is AI racist?

AI is what you feed it and what you check. We're the company that checks.

If your AI doesn't have a scoreboard, it's hanging up on customers you'll never meet. Demand the scoreboard — from us or anyone.

Every caller heard.
Every car counted.

Olelo Intelligence Mahalo. — Questions?  ·  olelo-ai.com  ·  5-question checklist at the booth
14Olelo Intelligence
Appendix — Sources (not presented)

The research behind this talk

Koenecke et al., “Racial disparities in automated speech recognition,” PNAS, 2020 (Stanford). Five commercial engines (Amazon, Apple, Google, IBM, Microsoft); avg. word error rate 35% for Black speakers vs 19% for white speakers; >20% of Black speakers' snippets unusable (WER ≥ 50%) vs <2%; gap persisted on identical phrases.

“Evaluating OpenAI's Whisper ASR,” JASA Express Letters, 2024. Significant performance differences across accents persist in modern models.

“Self-supervised speech models still struggle with AAVE,” arXiv 2408.14262, 2024. Elevated error rates on African American Vernacular English in wav2vec 2.0 / HuBERT / Whisper-family models.

Whisper UK regional-dialect adaptation, arXiv 2501.08502, 2025. Off-the-shelf models show elevated error on regional dialects; dialect-specific fine-tuning closes much of the gap.

ASR bias survey, arXiv 2211.09511; data/predictive bias, arXiv 2202.12603. Training-data distribution identified as root cause; cohort-level evaluation recommended.

Garnerin et al., LREC 2020. Audit of 66 open speech corpora: real-world (“found”) speech runs 68.1% male / 31.9% female speakers. Mozilla Common Voice reports — gender-labeled English clips skew roughly 3:1–4:1 male. UNESCO, “I'd blush if I could,” 2019 — women are 12% of AI researchers. Note: no major corpus publishes a racial breakdown; the racial-gap link to training data is the PNAS authors' attribution, not a counted statistic.

Martin & Tang, Interspeech 2020 — AAVE grammatical features drive elevated error rates. JAMIA Open 2024 — clinical transcription worse for Black patients. Pacific Northwest corpus study, 2025 — largest errors for African American speakers across all commercial systems tested.

Zendesk CX Trends (industry survey). 74% of customers report repeating themselves as a top frustration.

Olelo platform data (as of June 2026) — calls processed, invoice-matched calls, configurations. Olelo Revenue Recovery Analysis, January 2026 — per-location recovered revenue and ROI.

15Olelo Intelligence — appendix