Speech Recognition, often called Automatic Speech Recognition (ASR) or Speech-to-Text (STT), is the capability of a machine or program to identify words spoken aloud and convert them into readable text.
The sound itself is actually a wave
Feature Extraction
Firstly, convert the original signal into digital format
Then, divide digital audio into different frames and extract different signal for each frames
To identify the pattern and feature of each frame to come up with correct phonemes (The sound unit)
WER & CER
WER and CER are standard to recognize the accuracy of speech recognition