Why speech recognition will (probably) never work

Speech to text recognition software is one of personal computing’s final frontiers. The dream of sitting in a room and talking to your computer (and having it understand, compute, and respond accordingly) is, apparently, unlikely to ever become an actual reality. The problems are manifold – the biggest problems being that words are aurally ambiguous and we instinctively translate them based on context and expression, and that certain words have an array of meanings.

Here are a couple of snippets from this fascinating article, that ends up being more about language than voice recognition (you might also notice a couple of things I’ve posted recently in that article).

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged. With human discrimination as high as 98%, the unclosed gap left little basis for conversation. But sticking to a few topics, like numbers, helped. Saying “one” into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space…

Many spoken words sound the same. Saying “recognize speech” makes a sound that can be indistinguishable from “wreck a nice beach.” Other laughers include “wreck an eyes peach” and “recondite speech.” But with a little knowledge of word meaning and grammar, it seems like a computer ought to be able to puzzle it out. Ironically, however, much of the progress in speech recognition came from a conscious rejection of the deeper dimensions of language. As an IBM researcher famously put it: “Every time I fire a linguist my system improves.” But pink-slipping all the linguistics PhDs only gets you 80% accuracy, at best…

Researchers have also tried to endow computers with knowledge of word meanings. Words are defined by other words, to state the seemingly obvious. And definitions, of course, live in a dictionary. In the early 1990s, Microsoft Research developed a system called MindNet which “read” the dictionary and traced out a network from each word out to every mention of it in the definitions of other words.

Words have multiple definitions until they are used in a sentence which narrows the possibilities. MindNet deduced the intended definition of a word by combing through the networks of the other words in the sentence, looking for overlap. Consider the sentence, “The driver struck the ball.” To figure out the intended meaning of “driver,” MindNet followed the network to the definition for “golf” which includes the word “ball.” So driver means a kind of golf club. Or does it? Maybe the sentence means a car crashed into a group of people at a party.

To guess meanings more accurately, MindNet expanded the data on which it based its statistics much as speech recognizers did. The program ingested encyclopedias and other online texts, carefully assigning probabilistic weights based on what it learned. But that wasn’t enough. MindNet’s goal of “resolving semantic ambiguities in text,” remains unattained. The project, the first undertaken by Microsoft Research after it was founded in 1991, was shelved in 2005.