Why Baidu's breakthrough on speech recognition may be a game changer
Deep Speech 2, a speech recognition network developed by China's answer to Google, is so stunningly accurate it can transcribe Chinese better than a person, writes Will Knight
Stroll through Sanlitun, a bustling neighbourhood in Beijing filled with tourists, karaoke bars and luxury shops, and you'll see plenty of people using the latest smartphones from Apple, Samsung and Xiaomi. Look closely, however, and you might notice some of them ignoring the touch screens on these devices in favour of something much more efficient and intuitive: their voice.
A growing number of China's 691 million smartphone users now regularly dispense with swipes, taps and tiny keyboards when looking things up on the country's most popular search engine, Baidu. China is an ideal place for voice interfaces to take off, because Chinese characters were hardly designed with tiny touch screens in mind. But people everywhere should benefit as Baidu advances speech technology and makes voice interfaces more practical and useful. That could make it easier for anyone to communicate with the machines around us.
"I see speech approaching a point where it could become so reliable that you can just use it and not even think about it," says Andrew Ng Yan-tak, Baidu's chief scientist and an associate professor at Stanford University, in the United States. "The best technology is often invisible and, as speech recognition becomes more reliable, I hope it will disappear into the background."
Voice interfaces have been a dream of technologists (not to mention science-fiction writers) for many decades. But in recent years, thanks to some impressive advances in machine learning, voice control has become a lot more practical.
No longer limited to just a small set of predetermined commands, it now works even in a noisy environment, such as the streets of Beijing or when you're speaking across a room. Voice-operated virtual assistants such as Apple's Siri, Microsoft's Cortana and Google Now come bundled with most smartphones, and newer devices, such as Amazon's Alexa, offer a simple way to look up information, cue up songs and build shopping lists with your voice. These systems are not perfect, sometimes mishearing and misinterpreting commands in comedic fashion, but they are improving steadily, and they offer a glimpse of a graceful future in which there's less need to learn a new interface for every new device.