Advertisement

How machines' speech recognition capability has been evolving

State-of-the-art systems are enabling the practical use of voice commands for professionals such as surgeons and pilots

Reading Time:4 minutes
Why you can trust SCMP
How machines' speech recognition capability has been evolving

I arrived in Pittsburgh late at night after 24 hours of travel from Hong Kong recently to visit my daughter. We needed a stiff drink, I told my daughter who picked me up from the airport.

Advertisement

She told her iPhone: "Find liquor stores near Pittsburgh Airport." Within seconds, a map appeared on the screen with a list of liquor stores nearest the airport.

Of course I had talked to machines before - calling directory enquiries and phone banking - but the ability of her smartphone to meet my need for alcohol made me realise that machines that recognise and respond to speech will only get smarter. That was some techno-epiphany.

In the rapidly developing technology of Automatic Speech Recognition (ASR), machines are "hearing" and understanding spoken language, and performing actions on verbal commands.

ASR actually predates the invention of the computer by 50 years: in the 1870s, Alexander Graham Bell experimented in transmitting speech by his wife, who was deaf. He had hoped to create a device that would transform a spoken word into a picture that a deaf person could see; that line of research eventually led to his invention of the telephone.

Advertisement

Conceptually, ASR is simply the machine-matching of sounds with words. By using models of the sounds of a language to build a library of words, speech can theoretically be matched with words. If the words can be fitted into a certain set of grammatical and syntactical rules, they can then be arranged into sentences, a process known as rule-based pattern recognition.

However, human language has infinite variety in the way sounds are made and strung together to form words. In any single language, accents, dialects and mannerisms vary from region to region and across social and economic groups, and can vastly change the way certain words or phrases are spoken. To encompass such numerous variations, state-of-the-art ASR systems are based on complex statistical methods known as Hidden Markov Models and neural networks.

Advertisement