Understanding the Features of a Convnet Trained for Phone Recognition

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
For convolutional neural networks (convnets) trained for image recognition it is known what the features represent. However, for convnets trained for phone recognition this is not known yet. This study tried to answer the following question: What do the features of such a convnet represent? A convnet with three convolutional layers was trained on the TIMIT phone recognition task and a deconvnet was applied to obtain visualizations of its features. In experiment 1 the deconvnet was applied on the activation caused by the top 4 input phones per feature. In experiment 2 it was applied on the activation caused by the top 3 average phones per feature. Phone label analysis reveals consonant-, front vowel- and back vowel-sensitive features in the third layer. For both experiments, the visualizations were hard to interpret. It could be that visualizing features that represent aspects of audio is not the best way to gain insight into the features, although more experiments that use different convnet architectures should be run to confirm this. Future research could search for other ways to gain insight into the representations of the features, by for example further exploring the possibilities of phone label analysis.
Faculteit der Sociale Wetenschappen