What Are the Different Speech Recognition Techniques?

Several speech recognition techniques are used to capture spoken words and convert them into data that can be used by a software program. There are three broad ways to analyze speech in an effort to determine what is being said. The first is called discrete speech, meaning only a single word is spoken at a time. The second is known as connected speech, and words must be spoken in a certain manner to be understood. Finally, there is continuous speech, which is how most people normally speak.

The most common algorithm used to for all types of speech recognition techniques is the Hidden Markov Model (HMM). This system involves large data trees of phonemes, or basic sounds and syllables, which are divided by the statistical probability of one sound following another. By comparing each phoneme to a node in the data tree of sounds, the actual completed word can be determined with a high rate of accuracy in a relatively short period of time.

One problem that is difficult to overcome with some speech recognition techniques is isolating where a word starts and ends. This task is complicated by background noise in the room and the fact that some syllables have an audio signature that resembles a break between words. For this reason, discrete and connected speech recognition techniques are the most accurate.

Another factor that separates different speech recognition techniques is the issue of software vocabulary. Software that is interpreting speech can either have a very limited vocabulary with a high accuracy, or a large vocabulary that must be matched to a specific user’s individual speech patterns. When a program uses the HMM method of assembling words, the fewer the number of words that are understood, the more accurate the program can be. This is the method that most automated telephone systems use to decipher numbers or responses to questions.

Speech recognition techniques that understand a large vocabulary are usually designed to interact with very few or only one user. This is because the program must be trained to understand the speech patterns of the person speaking. The training involves reading pre-made paragraphs of text to the software. The words being read are known, so the program is able to build a statistical model of phonemes specific to the user. This gives the program a much better chance of understanding the user, but it also might hinder the program’s understanding of people with whom it has not trained.

The most difficult of the speech recognition techniques is interpreting continuous or natural speech. Many people tend to run words together and speak at different speeds, so the accuracy of programs that translate continuous speech is lower than that of the other methods. Still, programs do exist that can translate this type of speech, some of them employing fuzzy logic and neural networks to help recognize patterns and isolate words.