Talking Face To Interface -- Recent Advances In Voice Technology Allow Computer Users To Have The Final Say
Seattle Times Staff Reporter
You could excuse voice-technology researchers for having a beef with "Star Trek." After all, no matter what kind of advances they make in their work, their progress is always measured against the movie moment when the Enterprise escapes obliteration thanks to an omniscient, talking computer working at breakneck speed.
Even so, with an eye toward making computers and consumer electronics more user-friendly, a number of companies - large and small - are boldly going where speech technology has not been before. Some pleasant surprises are just around the corner, including a new program that gets past the robotlike talk required by existing speech-recognition programs.
In two major addresses last month, in Miami and Seattle, Microsoft Chairman Bill Gates singled out progress in speech technology, saying coming advances will lower barriers that have stood between ordinary people and computers. Microsoft, he noted, is spending hundreds of millions of dollars on voice-recognition research.
He pointed to the day when, instead of using keywords to find things on the Internet, "You'll say the same thing to your computer. . . . And all the richness of linguistics and common sense will be applied in helping perform that operation."
Articulating the goal of speech recognition and speech synthesis is easy: Getting computers to understand our voice, and to talk back. Achieving it is not, experts say.
It may be years before we can give a rapid-fire command:
"Computer: How much time before the mortgage is due? Make me a latte. Set the DVD to record the World Series . . . and draw the shades while you're at it."
And when, in the voice font and personality of your choosing, the computer responds:
"No problemo: 15 days until a penalty. Did you want 1% or 2% today? How 'bout those M's . . . Want the pregame show? Shall I close the windows before drawing the shades?"
Nevertheless, though they may seem meager by comparison, these early adaptations are here now:
-- Telephone directory-assistance systems that recognize cities and names and assist in placing and receiving collect calls.
-- Car-navigation systems that permit drivers to ask for directions and "hands free" car cellular voice-phone systems.
-- Charles Schwab & Co.'s "VoiceBroker" system, which allows investors to check on the status of any one of more than 13,000 stocks.
-- "Talk To Me" products, launched earlier this year by Globalink, to help learn Spanish, French, German and English.
Due to ship by the end of the month is a product from Dragon Systems called Naturally Speaking, which several analysts consider significant. For the first time, they say, a company will have produced a general-purpose dictation program that does not require the speaker to pause between words for the computer to follow.
For more than a decade, Dragon, IBM and Kurzweil Applied Intelligence (which is being acquired by Lernout & Hauspie Speech Products) have offered "discrete speech" or "isolated-word" dictation programs. These require the user to pronounce each word distinctly, inserting at least a slight pause between words. Other companies, such as Voice Pilot Technologies, are licensing proven speech-recognition engines, and creating new user interfaces in the march toward hands-free computing.
Lightweight consumer versions of these programs, available for under $100 including a microphone, are selling at the rate of 200,000 a month, according to Bill Meisel of TMA Associates, a speech-recognition consulting firm. "That puts them in a pretty significant category," he said. (IBM is expected to announce tomorrow that it is dropping the suggested retail price of its Simply Speaking model to $49.99, meaning a street price as low as $29).
While these programs work, their shortcomings tend to include a daunting training regimen, awkward editing controls, or a steep learning curve (on the computer's part) to adapt to a particular user's voice patterns. Perhaps their biggest drawback has been the discrete speech requirement.
By contrast, Naturally Speaking users need endure only what Dragon describes as a "fun" 18-minute training session, in which the user reads from Arthur C. Clarke's "3001: The Final Odyssey" or Dave Barry's "Dave Barry in Cyberspace." The product also promises better editing commands and - voila! - no need to hesitate between words.
"Continuous speech really changes the landscape," said Meisel. "A lot of people say they've been waiting for this."
Green light for fast talkers
Dragon's new product, which runs on Windows 95 and which IBM and LHS/Kurzweil expect to match later this year, requires a 133 megahertz Pentium processor and 32 megabytes of RAM. That's a fairly high bar given all the computers already deployed, but it's becoming increasingly standard for new units. The extra horsepower (discrete-speech programs can operate on 486 machines) is necessary to sort individual words coming in a faster speech stream.
Dragon Systems claims the product, which comes with a 30,000-word vocabulary, can keep up with fast talkers, capturing speech well in excess of 100 words per minute. It will cost $700 for the standard version.
The company maintains Naturally Speaking has an accuracy rate of 95 to 97 percent, though that claim was difficult to validate during a brief demonstration earlier this spring.
For example, when a company representative uttered "pronunciations," the computer heard and typed "For Nancy Asians."
Still, on the whole, the test version developers demonstrated in April performed admirably.
And given context-specific meanings, and the peculiar complexity of English (French is also a pain, while Italian and Spanish are much easier) it's impossible to completely eradicate errors, experts say. To a computer's ears, for example, there's not a heckuva lot of difference between "recognize speech" and "wreck a nice beach."
"Phonetically," observed Xuedong "XD" Huang, senior speech-technology researcher at Microsoft, "they are virtually identical."
Likewise, consider the challenge programmers face getting computers to distinguish "Imagine world peace" from "Imagine whirled peas," or picking the correct "rights" in this sentence: "Dr. Wright will write a note after turning right."
Experts say the English language consists of about 50 "phonemes" - basic sound units like the hard "c" in cake or the "da" in doctor. Speech-recognition engines are puzzle solvers: They identify phonemes, connect them to form words and eliminate possibilities based on context.
In addition to ferreting out homonyms and idioms, there are these challenges: ambient noise, accents, regionalisms, colds, stress and oddball variances ("Think of a 12-year-old boy going through puberty," said one expert.) The challenge is to develop technology capable of locking on to an individual's voice despite such variables.
"A computer is not robust," said Huang, a former computer-science professor at Carnegie Mellon University. "It is sensitive." That means it will make stupid mistakes unless properly programmed.
Each year, he said, there's an 8 to 10 percent reduction in the error rate relative to the previous year. "We'll never achieve (an) error-free" system, Huang said, but "if the error rate is below 2 per cent, I'll be happy."
Huang wasn't specific about when he expects to get happy, but said that Microsoft was approaching speech technology broadly. "Dictation is not our goal," he said.
Instead, his research team (which has grown from one to 20 in the past four years) sees the perfection of speech recognition and speech synthesis as a means to a more sophisticated end: The development of programs permitting two-way communications between people and computers with a "personality."
Tell computer to phone home
That means building computers able to detect and emulate speech patterns, voice inflections and moods. For example, Microsoft program manager Kevin Schofield cited entertainment possibilities such as adding interactive laughter and singing to MSN's TV-model program content.
The Microsoft team sees no reason why voice mail and e-mail shouldn't eventually merge, so that you could call your computer on a cell phone, tell the computer to check your e-mail, and have it read you those messages you wish to hear.
IBM, too, is working on developing a variety of speech applications. Spokeswoman Susan Scott-Ker noted IBM already has a continuous speech product for radiologists - MedSpeak.
The company is also working on various prototypes of its general-purpose continuous speech technology but, like Microsoft, is looking to develop speech as a major new interface with applications for all sorts of appliances, she said.
A hint is evident in IBM's voice-enabled Home Director technology, announced in February, which permits "networked" homes to control lights and garage doors through the Aptiva line of personal computers.
Lerner & Hauspie is developing speech-compression techniques to code speech into small digital files for storage and transmission, then decode them instantly with no loss of quality.
Bob Kutnick, L&H's chief technology officer, said the company, in partnership with another firm, plans to offer by Christmas hand-held devices for international travelers. The tourist would speak one of 100 or 200 programmed statements into the device, which would spit out the foreign-language equivalent.
Such advances impress even those who follow the research and development. "I wasn't expecting products like this for another . . . five to 10 years," said Amy Wohl, editor of TrendsLetter, a monthly newsletter on computer software.
But there's a difference, she noted, between a computer swallowing a big dictionary and a grammar-rule book and really understanding "the meaning of language."
She figures we're at least a decade away from the day when computers will do what we want them to do.
Trekkies take heart: the Enterprise still rules.
Copyright (c) 1997 Seattle Times Company, All Rights Reserved.