top of page
Writer's pictureSTEM To Go

How Speech Recognition Technology Works

With the ongoing advancements in electronics, a feature that has become increasingly common is speech recognition, where electronics recognize, and decipher our verbal speech. How does this work? Continue reading to find out!

 


Have you ever called a large organization and been answered with an automated voice recording? You would be asked a series of questions and can either dial numbers, or use your own voice, in response. If you decided to answer verbally, then the machine would automatically process the response, to carry on the interaction. This is all made possible with the power of a speech recognition program.


Whether it's Siri, Alexa, or Google Home, there are many smart programs that you can use solely with your voice. Nowadays, most devices have microphones that can pick up what you speak and transcribe them into words, either for sending messages, or even following commands. These commands can be used for different scenarios, such as automatically dialing a contact.


The Components Behind The Process


There are many types of speech recognition processes, such as automatic speech recognition (ARS) or speech to text (STT), where computers and other technological devices go through many steps to decipher human words. Verbal speech specifically, is produced by vibrations that travel through the air, which is then translated by an analog-to-digital converter (ADC), into computer language. In this process, small intervals of the sound, are digitized by being measured, and any excess noise (background noise) is filtered out and removed. Because human voices tends to switch from high to low tones throughout their dialogue, the system can normalize the frequency and pitch of the sound waves to be a consistent volume. These small normalized intervals of sound are compared to the stored template sound samples that already exist within the system.


Then, the sound signal is split up further based on recognition of plosive consonant sounds, such as "p" and "t" that produce airflow stops in speech. These intervals can be as small as hundredths or thousandths of a millisecond, and are compared and matched to phonemes, or sounds of a language, based on the specific language the system is programmed to process. This step of the process is actually the most challenging, as it is prone to many inaccuracies that can transcribe a message incorrectly. Since there are many words in a language, mistaking one for another is not uncommon. Additionally, variations including accents, dialects, and geographical mannerisms can be harder to decipher to voice recognition systems. Therefore many speech recognition engineers and researchers are attempting to perfect this part of the process.


Furthermore, besides just recognizing individual sounds or words, current speech recognition systems have implemented systems that predict phrases or even sentences following certain words, which are called statistical modeling systems. Through functions and probability, they predict words and phrases to construct the most probable coherent sentence being spoken. This ties into what words and phrases are most commonly used for, and what would make sense following what has already been deciphered and selected. This part of the process relies on data, which serve as the foundation for common usage.


Of course, as many improvements as there are in these programs, they do not have guaranteed 100% accuracy. There are other factors that can interfere with the process, including loud background noises and overlapping speech, which increase the difficulty of the program to detect the verbal signals of one distinct voice. Additionally, homonyms, or words that sound the same but are spelled differently, can be mistaken for one another (ex: there and their).


The Value of Speech Recognition


However, there is great use for these programs. Not only does speech recognition technology make daily tasks more convenient for the average user, it also serves various functions to aid people with special needs. For instance, it has made devices more accessible for the blind, who can now rely on speech recognition.


So the next time you talk to Siri, or an automated machine on the phone, you can recognize and appreciate the complexity and functionality of these incredible smart systems. Thank you for reading and we hope you learned something new!



 

References:


“Engineering Speech Recognition from Machine Learning.” Infosec Resources, 5 Aug. 2021, resources.infosecinstitute.com/topic/engineering-speech-recognition-from-machine-learning.


Grabianowski, Ed. “How Speech Recognition Works.” HowStuffWorks, 27 Jan. 2020, electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm.


“Speech Recognition - an Overview.” ScienceDirect, www.sciencedirect.com/topics/engineering/speech-recognition.



3 views0 comments

Recent Posts

See All

Σχόλια


bottom of page