Walt Mossberg and Microsoft CEO Satya Nadella at the Code Conference
RANCHO PALOS VERDES, California — Microsoft's new CEO, Satya Nadella, just steered the company into exciting and relatively uncharted territory: near real-time speech-to-speech translation.
"It’s been a dream of humanity ever since we started to speak and we wanted to cross the language boundary," said Nadella.
Speaking at Re/Code's inaugural Code Conference (formerly the "D Conference") in Southern California, Nadella and Skype Corporate VP Gurdeep Singh Pall made a Skype call to a non-English-speaking German friend. Then both parties spoke to and understood each other thanks to the live translation capabilities of the Skype Translator pre-beta.
"No one else does this," Pall told me, adding, "It's the first time something like this has been attempted." And it’s probably something we need.
English is not the most commonly spoken language in the world. By some estimates, it'sthird behind Chinese (and all its variants) and Hindi. However, our increasingly globalized society all but demands that we find a way to communicate across language barriers. Skype already, by Microsoft’s measure, boasts more than 300 million active members and handles roughly one third of international call traffic. Imagine what it could do with built-in voice translation.
Microsoft is not new to the speech recognition game. You'll find the same technology in the recently launched Cortana personal assistant in Windows Phone 8.1 and in the speech recognition that been live on Xbox 360, and now Xbox One, for over a year. Skype Translator, which comes out of Microsoft Research, is actually three technologies: speech recognition, text-to-speech and machine translation.
“The Skype community is big — REALLY big," wrote Peter Lee, Head of Microsoft Research in an email to Mashable."To make Skype Translator a reality, it needed great research to get the science right, and great engineering to make it practical and scalable."
Here's how Skype Translator works: Speaker A starts talking. Skype Translator recognizes the words and actually transcribes them into text. The text transcription of Speaker A is then translated into the language of Speaker B. It's then voice-synthesized into Speaker B's language.
This sounds slow, and Pall told me you do wait a little for the translation to happen. However, he insists that this is not a "tech latency problem." The process can go quite fast, but since there is a video component here, the system works to make it all seem natural.
While Speaker A is talking, Speaker B will actually hear their voice, at a lower volume, even as Skype Translator begins to do its work and starts delivering translated, spoken words. Moreover, the system is looking for natural pauses or, as Pall explained it, "silence detection," in speech to start translating. The length of time it takes to translate is totally dependent on the length of the sentence or phrase. The alternative would have been to have the speaker hold a button while speaking and let it go when they wanted to deliver a sentence or phrase. This approach should be more natural.
As for how Skype Translate knows which languages to use, you'll set your preferred language in preferences. No on-the-fly language detection, for now.
Like other speech systems, Skype Translator will learn over time and its translation and speech may improve. It will, however, be a little while before we're calling our cousins in Italy and saying in their native tongue, "Come stai amico mio?" Skype Translator will appear as a standalone Windows 8.1 app later this year. The goal is to bake it into Skype proper on all platforms, though Microsoft has not set a timetable this full-scale rollout.