Audio classification using GRU

Thumbnail Image
Issue Date
Journal Title
Journal ISSN
Volume Title
How the brain processes real life sound fragments into neural representations is studied actively and there are still many things unexplained. In this paper, inspired by Francl & McDermott (2022) and Van der Heijden & Mehrkanoon (2020), I investigated deep recurrent neural networks (RNNs) with gated recurrent units (GRUs) to come one step closer to understanding the auditory processing in humans. This biological inspired recurrent neural network is trained on predicting the azimuth location of sound as well as predicting the category of sound (i.e. speech, nature, urban, music and human sounds). Both predictions are multi-label multi class classification tasks, and the performance of the model is measured using the binary cross entropy loss. The model is human inspired because of the architectural design choices, such as separate left and right channel input. But also, each classification task has its own pathway, mimicking the different areas in the brain that perform audio localisation and identification. This model was tested using a train/test set of approximately 50,000 one-second audio fragments (approximately 14 hours of audio in total). Additionally, the model was evaluated on an unseen evaluation set to ensure ecological validity. Especially the localisation task of the model showed results that indicate generalisability. It also demonstrated similar error pattern compared to humans, as discussed in the paper. However, the identification task did not show the same results. It did not compare to human accuracy, nor did it have similar error patterns. Overall, the errors measured of this multi-task RNN were bigger than human performance. I suggest in order to conclude more from this human inspired GRU model, one needs to introduce more training data. Another way to extend this research would be by exploring different types of neural networks while staying true to the biological design. For instance, incorporating spiking neural networks (SNNs) into this research and an increase in quantity of the input data is an interesting next step in this field.
Faculteit der Sociale Wetenschappen