COMPUTER VOICE INPUT/OUTPUT

By: Vanessa Mitchell

                                         

 

INTRODUCTION

When you think of computer input and output, things such as microphones, speakers, monitors, and even the keyboards come to mind.  However, one of the most important types of input and output involves our voice, or speech.  Speech is one of our foremost ways of communicating.  Without speech many people would be frustrated trying to get the most basic ideas across.   Through computer voice input and output products, many people are finding it easier to communicate.  Voice recognition and speech synthesis is helping to reach the needs of people with and without disabilities.  Computer voice input and output is allowing hands free computing to be possible. 

WHAT IS COMPUTER VOICE INPUT

Voice input, also known as accessing the computer by the voice can be useful to anyone.  Computer voice input products use speech recognition to learn how a particular user says a word and uses this information to predict what word is being spoken.  The most powerful speech recognition systems can recognize thousands of words.  Many different techniques are used to perform speech recognition.  Usually speech recognition starts with the digital sampling of speech.  Second, comes the acoustic signal processing.  And most all techniques include spectral analysis, Linear Predictive Coding, Mel Frequency Cepstral Coefficients, cochlea modeling and even more.  Thirdly, comes the recognition of phonemes and words.  Phonemes are the forty-five distinguishable sounds in the English language.  Expert systems use combinations of techniques to complete this process.  The most commonly used is the HMM (Hidden Markov Modeling) based system.  Depending on the systems being used people are reporting speeds of up to 100 words per minute when dictating. Continuous speech recognition systems allow you to speak more naturally and quicker than discrete speech systems. However, before the late 1990’s when dictating, discrete voice recognition was used. Dictating with the discrete voice recognition system requires pauses to be interjected after each word has been spoken slowly.  For instance consider the following versus continuously speaking:

Imagine…how…much…longer…this…would…take…to…pause…after…each…word…rather…than…continuously…speaking…

Today's advances in computer voice input technology allows for you to input information continuously cutting the time it takes to dictate or give commands to your computer. Some other advancements in voice input technology includes allowing people with impediments or strong accents to use the programs and recognition of more than one voice per program.  Certain software even allows you to play games by voice, find out what time it is, surf the web or even open documents with just a couple words.  Like on Star Trek, future computers will talk to you and listen to what you have to say.

VOICE INPUT PRODUCTS

Some frequently used voice input products include the advanced IBM ViaVoice and Dragon Systems Naturally 

Speaking series.              

                                  ViaVoice

                                 Naturally Speaking   

 

Installation of the software is very basic.  One thing that should be considered before downloading either is the time it takes to do so.  For both IBM Via Voice and Dragon Systems Naturally Speaking the most time consumption goes into the creation of your personal user file.  On either system you are required to dictate 20 to 30 minutes worth of prepared text and then the processing of the dictation takes another 25 minutes or so.  By doing this it can help to increase the recognition systems accuracy.  After you install the software and create your personal user file, next you merely launch the software, put on your headset and begin dictating.  Although, as previously stated you don't have to speak extremely slowly with pauses between each word, you will have to take more care to pronounce each word carefully.  It isn't quite the same as talking to a family member or friend.  The software will provide a program you can dictate into, which is basically a word processor or you can dictate into programs such as Microsoft Word or Word Perfect.  Dictating will seem awkward at first and simple phrases and mathematical equations may not be so easy.  Take for example the following:  (y - 1) = (y - 2) + 1, would be dictated as:

Open parenthesis, turn on italics, y, turn off italics, minus, one, close parenthesis, equal sign, open parenthesis, turn on italics, y, turn off italics, minus, two, close parenthesis, plus, one

So you get the idea, it may not be as easy as it sounds.  Another thing to take into consideration is the space these programs take up.  The different dictionaries and specialty features such as having more than one user can take up a lot of space on your hard drive.   Cost can also be an issue.  Some advanced voice input systems can cost as much or more than a $1000, while other may cost less than $100.  In the past voice input systems haven't been used in the main stream.  Most of the use of these systems came from people with disabilities who could not use the keyboard or mouse.  However, because of improvements in quality and a more competitive price, voice recognition systems are moving into more of the main stream.

COMPUTER VOICE OUTPUT

Where computer voice input takes voice commands and formulates text or completes specific functions, computer voice output allows the computer to speak for you.  The computer uses speech synthesis, which allows it to produce sound that resembles human voice.  One of the earliest speech synthesizers was developed by Joseph Faber in 1846. Speech synthesis still cannot cover the wide range of emotions heard in a human voice, however, it is able to take text or files and produce intelligible voice output.  There are four main ways synthesizing is done, fixed speech, wave table synthesis, formant synthesis, and concatenative synthesis.  Fixed speech is also known as digital speech, in this method a persons voice is recorded onto a chip. This method is usually used with automated systems.  Wave table synthesis works by taking phonemes and finding each one in a table of waveforms. The final voice is usually not very natural and can be hard to comprehend. Formant synthesis converts a stream of phonemes into a mathematical model of the human vocal tract.  Although the process can be very complex, the final voice is very natural and easy to understand. Concatenative speech combines concepts from fixed speech, wave table synthesis, and formant synthesis.  As many as 1400 voice samples would need to be taken one phoneme at a time. With the complexity of concatenative speech synthesis, only the most powerful computers with a massive amounts of memory can accomplish it.  The products on the market today allow users to by pass most of the technical and complex issues and focus on the output.

VOICE OUTPUT PRODUCTS

 

Voice output aides generate printed or spoken text.  These products have been a great help to individual who have difficulty communicating by speech or other disabilities. Finger touch screens, joysticks, and keyboard overlays allow users to select letters, words, phrases or even pictures, they would like to communicate.        

Other products simply take text to spoken words; an example of this would be the TextAloud MP3.  The TextAloud MP3 is great for anyone who is tired of reading from the computer screen.  This product allows for you to do other projects while listening to e-mails, word documents, or even web pages.

Try it for yourself. These two websites offer demos for text to voice.

http://www.bell-labs.com/project/tts/voices.html

http://www.research.att.com/~ttsweb/cgi-bin/ttsdemo

Most all-major computer company’s offers voice output products similar to these and more. The price of voice output products range from about $20 to in the thousands, depending on your needs and expectations.

 

THE FUTURE OF VOICE INPUT/OUTPUT AND THE CHALLENGES IT WILL FACE

                                             

The future of computer voice input/output is growing.  With competition driving cost down, and the rise in user friendly products being developed, more and more people are taking advantage of this technology.  Some of the challenges that are being faced and will be faced in the future include: natural sounding voice, range of emotions, more voices available, better portability, broader range of languages, and more awareness of the technology.  However, everyday break throughs are being made in each of these fields.  For instance both speech synthesis and speech recognition for languages other than English has continued to increase.  Many people who speak other languages are now able to take benefit from this technology.  Computer voice input/output technology will continue to push through barriers that stand in its way.  Speech technology will become less Star Treky and more everyday life, as it migrates into our homes.

CONCLUSION

As computer voice input/output continues to grow, with it does opportunity.  Everyone can benefit in some way from this technology.   The gaining awareness of computer voice input/output will be reflected in our schools, jobs, and everyday lives.  Ultimately computer voice input and output will create an easier, faster, and higher quality way of life. 

 

 

 

   

 

 

 

 

Webliography

http://www.utoronto.ca/atrc/reference/tech/voca.html

http://www.research.att.com/~ttsweb/cgi-bin/ttsdemo 

http://sun.com

http://speech.cmu.edu

http://www.research.microsoft.com

http://www.concentric.net/~Hexman/speechsynth.html - intro

http://www.nextuptech.com/

http://encarta.msn.com/