For people with accents — even within the United States — the artificially intelligent speakers can seem inattentive, unresponsive, even isolating. Experts say too many of the people training, testing and working with the systems all sound the same.
When Meghan Cruz says “Hey, Alexa,” her Amazon smart speaker bursts to life, offering the kind of helpful response she expects from her automated assistant.
With a few words in her breezy West Coast accent, the lab technician in Vancouver, B.C., gets Alexa to tell her the weather in Berlin (70 degrees), the world’s most poisonous animal (a geography cone snail) and the square root of 128, which it offers to the ninth decimal place.
But when Andrea Moncada, a college student and fellow Vancouver resident who was raised in Colombia, says the same in her light Spanish accent, Alexa offers only a virtual shrug. She asks it to add a few numbers, and Alexa says sorry. She tells Alexa to turn the music off; instead, the volume turns up.
“People will tell me, ‘Your accent is good,’ but it couldn’t understand anything,” she said.
Amazon’s Alexa and Google’s Assistant are spearheading a voice-activated revolution. But for people with accents — even the regional lilts, dialects and drawls native to various parts of the United States — the artificially intelligent speakers can seem very different: inattentive, unresponsive, even isolating. For many, the wave of the future has a bias problem, and it’s leaving them behind.
The Washington Post teamed up with two research groups to study the smart speakers’ accent imbalance, testing thousands of voice commands dictated by more than 100 people across nearly 20 cities. The systems, they found, showed notable disparities in how people from different parts of the U.S. are understood.
People with Southern accents, for instance, were 3 percent less likely to get accurate responses from a Google Home device than those with Western accents. And Alexa understood Midwest accents 2 percent less than those from along the East Coast.
People with nonnative accents faced the biggest setbacks. In one study that compared what Alexa thought it heard versus what the test group actually said, the system showed that speech from that group showed about 30 percent more inaccuracies.
People who spoke Spanish as a first language, for instance, were understood 6 percent less often than people who grew up around California or Washington, where the tech giants are based.
“These systems are going to work best for white, highly educated, upper-middle-class Americans, probably from the West Coast, because that’s the group that’s had access to the technology from the very beginning,” said Rachael Tatman, a data scientist who has studied speech recognition and was not involved in the research.
At first, all accents are new and strange to voice-activated AI, including the accent some Americans think is no accent: the predominantly white, nonimmigrant, nonregional dialect of TV newscasters, which linguists call “broadcast English.”
The AI is taught to comprehend different accents by processing data from lots and lots of voices, learning their patterns and forming clear bonds among phrases, words and sounds.
To learn different ways of speaking, the AI needs a diverse range of voices — and experts say it’s not getting them because too many of the people training, testing and working with the systems all sound the same. That means accents that are less common or prestigious end up more likely to be misunderstood, met with silence or the dreaded, “Sorry, I didn’t get that.”
Tatman, who works at the data-science company Kaggle but was not speaking on the company’s behalf, said, “I worry we’re getting into a position where these tools are just more useful for some people than others.”
Company officials said the findings, while informal and limited, highlighted how accents remain one of their key challenges.
“The more we hear voices that follow certain speech patterns or have certain accents, the easier we find it to understand them. For Alexa, this is no different,” Amazon said in a statement. “As more people speak to Alexa, and with various accents, Alexa’s understanding will improve.” (Amazon Chief Executive Jeff Bezos owns The Washington Post.)
Google said it “is recognized as a world leader” in natural-language processing and other forms of voice AI. “We’ll continue to improve speech recognition for the Google Assistant as we expand our data sets,” the company said in a statement.
The researchers did not test other voice platforms, such as Apple’s Siri or Microsoft’s Cortana, which have far lower at-home adoption rates.
Nearly 100 million smart speakers will have been sold around the world by the end of the year, the market-research firm Canalys said. Alexa now speaks English, German, Japanese and, as of last month, French; Google’s Assistant speaks all those plus Italian and is on track to speak more than 30 languages by the end of the year.
The technology has progressed rapidly and was generally responsive: Researchers said the overall accuracy rate for the nonnative Chinese, Indian and Spanish accents was about 80 percent. But as voice becomes one of the central ways humans and computers interact, even a slight gap in understanding could mean a major hurdle.
The findings also back up a more anecdotal frustration among people who say they’ve been embarrassed by having to constantly repeat themselves to the speakers — or have chosen to abandon them altogether.
“When you’re in a social situation, you’re more reticent to use it because you think, ‘This thing isn’t going to understand me and people are going to make fun of me, or they’ll think I don’t speak that well,’” said Yago Doson, 33, a marine biologist in California who grew up in Barcelona and has spoken English for 13 years.
Doson said some of his friends do everything with their speakers, but he has resisted buying one because he’s had too many bad experiences.
Smart speakers like the Amazon Echo and Google Home have rapidly created a place for themselves in daily life. One in five U.S. households with Wi-Fi now have a smart speaker, up from one in 10 last year, the media-measurement firm ComScore said.
Most Read Nation & World Stories
- Despite stay-at-home orders, 6 out of 10 are on roads, and Seattle traffic hovers around 50% of typical levels
- New York gets Chinese ventilators; Trump wants more thanks VIEW
- Coast Guard: Cruise ships must stay at sea with sick onboard VIEW
- Inside the coronavirus testing failure: Alarm and dismay among the scientists who sought to help
- Coronavirus death toll: Americans are almost certainly dying of COVID-19 but being left out of the official count
The companies offer ways for people to calibrate the systems to their voices. But many speaker owners have still taken to YouTube to share their battles in conversation. In one viral video, an older Alexa user pining for a Scottish folk song was instead played The Black Eyed Peas.
Matt Mitchell, a comedy writer in Birmingham, Alabama, whose sketch about a drawling “southern Alexa” has been viewed more than 1 million times, said he was inspired by his own daily tussles with the device.
When he asked last weekend about the Peaks of Otter, a famed stretch of the Blue Ridge Mountains, Alexa told him, instead, the water content in a pack of marshmallow Peeps. “It was surprisingly more than I thought,” he said with a laugh. “I learned two things instead of just one.”
The companies run their AI through a series of sometimes-oddball language drills. Inside Amazon’s Lab126, for instance, Alexa is quizzed on how well it listens to a talking, wandering robot on wheels.
The teams that worked with The Post on the accent study, however, took a more human approach.
Globalme, a language-localization firm in Vancouver, asked testers across the United States and Canada to say 70 preset commands, including “Start playing Queen,” “Add new appointment” and “How close am I to the nearest Walmart?”
The company grouped the video-recorded talks by accent, based on where the testers had grown up or spent most of their lives, and then assessed the devices’ responses for accuracy. The testers also offered other impressions: People with nonnative accents, for instance, told Globalme they thought the devices had to “think” for longer before responding to their requests.
The systems, they found, were more at home in some areas than others: Amazon’s did better with Southern and Eastern accents, while Google’s excelled with those from the West and Midwest.
The tests often proved a comedy of errors, full of bizarre responses, awkward interruptions and Alexa apologies. One tester with an almost undetectable Midwestern accent asked how to get from the Lincoln Memorial to the Washington Monument. Alexa told her, in a resoundingly chipper tone, that $1 is worth 71 pence.
A second study, by the voice-testing startup Pulse Labs, asked people to read three different Post headlines — about President Donald Trump, China and the Winter Olympics — and then examined the raw data of what Alexa thought the people said.
The difference between those two strings of words was about 30 percent greater for people with nonnative accents than native speakers, the researchers found.
People with nearly imperceptible accents, in the computerized mind of Alexa, often sounded like gobbledygook, with words like “bulldozed” coming across as “boulders” or “burritos.”
When a speaker with a British accent read one headline — “Trump bulldozed Fox News host, showing again why he likes phone interviews” — Alexa dreamed up a more imaginative story: “Trump bull diced a Fox News heist showing again why he likes pain and beads.”
Accents, some engineers say, pose one of the stiffest challenges for companies working to develop software that answers questions and carries on natural conversations and chats casually, like a part of the family.
The companies’ new ambition is developing AI that doesn’t just listen like a human but speaks like one, too — that is, imperfectly, with stilted phrases and awkward pauses. In May, Google unveiled one such system, Duplex, that can make dinner reservations over the phone with a robotic, lifelike speaking voice — complete with automatically generated “speech disfluencies,” also known as “umms” and “ahhs.”
In the meantime, people like Moncada, the Colombian-born college student, feel stuck. “I’m a little sad about it,” she said. “The device can do a lot of things. … It just can’t understand me.”