Neural Networks and Glimpses for Speech-in-Noise Understanding
Humans use glimpses to identify speech in noise. However, Automatic Speech Recognition (ASR) systems often look at signal-to-noise ratios (SNRs) as a predictor for speech intelligibility. This research extends the studies by Zhu et al. and Cooke et al. by evaluating the importance of glimpses in noisy environments and the performance of an artificial neural network. The existing wav2vec 2.0 model by Baevski et al. is used to test the performance of this model on both clean and noisy speech, followed by an analysis of glimpses. Results show that there is a strong positive correlation between the word accuracy and the glimpse ratio which indicates that neural networks rely on glimpses for speech-in-noise understanding. It is also shown that glimpses are a better predictor for word accuracies than signalto- noise ratios and that glimpses contribute more to the understanding of non-stationary- than stationary- noise types.
Faculteit der Sociale Wetenschappen