Understanding the Features of a Convnet Trained for Phone Recognition

Keywords

Loading...
Thumbnail Image

Authors

Issue Date

2015-07-13

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

For convolutional neural networks (convnets) trained for image recognition it is known what the features represent. However, for convnets trained for phone recognition this is not known yet. This study tried to answer the following question: What do the features of such a convnet represent? A convnet with three convolutional layers was trained on the TIMIT phone recognition task and a deconvnet was applied to obtain visualizations of its features. In experiment 1 the deconvnet was applied on the activation caused by the top 4 input phones per feature. In experiment 2 it was applied on the activation caused by the top 3 average phones per feature. Phone label analysis reveals consonant-, front vowel- and back vowel-sensitive features in the third layer. For both experiments, the visualizations were hard to interpret. It could be that visualizing features that represent aspects of audio is not the best way to gain insight into the features, although more experiments that use different convnet architectures should be run to confirm this. Future research could search for other ways to gain insight into the representations of the features, by for example further exploring the possibilities of phone label analysis.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen