We use cookies to improve your experience on our website. Accept | Find out more


Voice Coach

How one Google exec is leveling language barriers around the world

Author ARNIE COOPER Illustration Alex Nabaum

WHEN MIKE COHEN WAS 2, his parents bought him a little white piano for Hanukkah. Forget the fact that Cohen actually recalls this, which is a feat unto itself — what’s most intriguing is that this children’s toy embodied two concepts that would end up guiding his life. “I remember a sense of fascination I had with the piano,” Cohen says, “by both the mathematics of it, in a very abstract sense, and the idea that you could combine all these patterns of threes and twos into sounds. It seemed like this wondrous machine, this world of potential.”

That same sense of potential colors his days at Google’s Mountain View, Calif., headquarters (a.k.a. the Googleplex), where the 50-something Brooklyn, N.Y., native has been manager of speech technology since 2004. Cohen oversees the internationalization project for the speech recognition team — a mouthful, no doubt, but it boils down to some pretty cool stuff you can now do with your mobile phone or computer, provided it’s running Google’s Chrome browser.

Ironically, Cohen, who took French in high school, has little facility with foreign languages (unless you count the inflections of his native Brooklynese), but he does have an ear for music. After graduating from Boston’s Berklee College of Music with a degree in composition, Cohen spent seven years playing guitar with his own sextet, and at one point even traveled to Haiti to study voodoo rhythms.

In the end, however, his scientific side won out. Cohen got a Ph.D. in computer science from the other Berkeley — at the University of California — and spent the next two decades working in speech technology both at Stanford and at a company of his own. “I was always partly a scientist, partly a composer,” he says. “But as I got interested in computation, I was drawn to speech recognition because the same set of cognitive questions came up.” Speech and music, he explains, “are the two complex auditory signals that humans have evolved to communicate with.”

In 2004, it became clear to Cohen that speech technology was going mobile. Though he thought about starting another firm, he ultimately went with Google, impressed by the company’s “focus on big data,” he says. He’s been working on its speech recognition technology ever since.

Here’s how it works: Say you’re looking for a place to grab dinner in L.A., but you’re in your car and it’s unsafe to type in your request. After you click the “mic” button on your phone, a box pops up onscreen instructing you to “speak now.” You say “Los Angeles restaurants” and — voilà! — the voice search software transforms your words to text and links you to any number of relevant websites. (This is also much the same functionality that the iPhone’s Siri software famously provides.)

But what if you’re one of the 5.5 billion people who don’t speak English? Cohen knew it wouldn’t be enough to offer voice search in just one language, so in 2009 the speech recognition team began the arduous process of bringing the tech to the rest of the globe. You’d think they’d start with something easy, like Spanish; “We chose Mandarin,” Cohen says, laughing. The idea was to get a head start on grappling with every possible linguistic obstacle in this notoriously complex Chinese dialect before moving on to other languages. “Besides, there were a couple of people in the Beijing office really raring to go,” he says, “and there’s nothing like a motivated engineer to make things happen.”

The language modeling process, which took almost a year for Mandarin Chinese, can now be carried out in a few weeks. Operating like a linguistic hit squad, a team of native speakers travels to the target country armed with 30 or 40 Android phones, which they distribute to temp employees who take them out into the community to record locals. A couple of days and 250,000 or so utterances later, the data is used to create a statistical model that “learns” enough of the vocabulary, grammar and syntax to be deployed on the country’s Google search page. As word spreads and millions of Koreans, Indians or Russians, for example, discover they can Google with just their voices, the model actually starts training itself, through what Cohen calls “unsupervised learning.”

At last count, 27 language models have been completed, including five variants of English and four of Spanish, along with more exotic languages like Afrikaans, Bahasa Malay and even pig Latin, done as a stunt for April Fool’s Day.

And it goes even deeper. “Let me demo this for you,” Cohen says as he grabs my iPhone (his Android was on the blink that day, which should please the ghost of Steve Jobs to no end) and clicks on my Google Translate app. “Say I want to go from English to Spanish. And I want to do it by talking.” Cohen says “good night” into the phone and the text “buenas noches” appears (or, if he had hit the mic button, it would have been spoken). Selecting the program’s conversation mode will activate its “turn taking” feature, so if your Spanish comprises nothing beyond, well, “buenas noches,” you’ll still be able to converse with your Chilean colleague by sticking the phone in the middle of the table and waiting a few seconds for the device to translate what each of you says.

The technology remains a work in progress, but the goal is as simple as it is ambitious: to allow people who speak different languages to communicate seamlessly in real time, a sort of inverse Babel. “The user should never have to wonder whether they can accomplish their current task by speaking,” Cohen says. “If they want to speak, they should assume they can.”

ARNIE COOPER, a Santa Barbara, Calif.–based writer and part-time ESL instructor, uses Google Voice to help his students improve their accents — driving them crazy in the process.

Leave your comments