Mind the Linguistic Gap: Studying the learning of linguistic properties of continuous sign language videos in an isolated sign language recognition task
This research carried out in this thesis makes an initial investigation into whether a deep neural network, more concretely a 3D convolutional neural network (3D-CNN), is able to learn any aspect of continuous sign language (SL) linguistics in an isolated sign language recognition (ISLR) task. To do so, we use Dutch SL or Nederlandse Gebarentaal (NGT) data from the Corpus Nederlandse Gebarentaal (CNGT) and NGT Signbank. We define a Linguistic Gap (LG) as the difference between SL linguistics knowledge and the observable linguistic properties learnt by the classifier. We hypothesize the existence of a LG in the difference between the intrinsic dimension (ID) of the 1024-dimensional neural representations found in the last hidden layer of our classifier and the 21 theoretical dimensions of NGT we derive from linguistic specifications in NGT Signbank. To study the LG effectively, we design a new linguistically centered methodology in which the effect of linguistics on the classification is showcased. Given the isolated nature of the sign language recognition (SLR) task, we determine that phonology is the most straightforward linguistic aspect to study in this work. Thus, we use the phonological difference between pairs of signs to design and evaluate different experiments that approach the binary classification task from a linguistic perspective. We freeze all layers in the model except for the last hidden layer while fine tuning on our SL data. This confines the potential linguistic knowledge acquired by the network to this last hidden layer, which allows us to study the ID in relation to linguistics. To the best of our knowledge, we present the first application of ID on video data and on representations learnt on SL data. To extract the ID of the neural representations, we use the maximum likelihood estimation (MLE) and Two-Nearest Neighbours (TwoNN) algorithms, which are the only recorded applications of ID estimation on image data and on neural representations of image data. We carry out three experiments, in which we compare the classification of minimal pairs, i.e., two signs with different meaning that differ only in one phoneme, with non-minimal pairs. The first experiment highlights the effect of phonological difference between pairs of signs on the LG when a maximum amount of data is available. We compare four classifiers trained on the two most frequent non-minimal pairs and the two most frequent minimal pairs in the dataset. The second experiment keeps the best-performing minimal pair and non-minimal pair to study the effect of input data resolution on the LG. In the last experiment, we expand on the concept of minimal pairs and make a first introduction of phonological distance, which gives us a measure of the phonological difference between non-minimal pairs. We study the effect of this distance between pairs of signs to gain further insight into how the network incorporates SL linguistic knowledge in the classification. In these experiments, we discover through the calculation of ID that the last hidden layer of our I3D model is capable of representing SL data in latent space as effectively as the repre-sentations made by linguists in NGT Signbank, albeit remaining highly over-represented with respect to the dimensionality of its feature vector. We also observe that the ID of the neural representations in this layer is not sensitive to phonology of signs, but to other aspects such as spatial and temporal resolution of the input data. These initial results suggest that, in opposition to our initial hypothesis, the LG does not lie in the difference between the ID of the neural representations and the theoretical ID of NGT. Finally, through the study of our phonological distance measure, we discover that the classification performance of the I3D model increases with increasing phonological distance between the classified pairs of signs, suggesting that knowledge captured by the network is related to phonology, among other visual aspects of the data. This research contributes to the field of interpretability of SL technologies through the study of phonological aspects of SL in the representations of the last hidden layer of a binary classifier in an ISLR task. We discuss the implications of understanding how a deep neural network performs classification to improve performance and interpretability of SL systems and encourage research to further study linguistics and its impact on them.
Faculteit der Sociale Wetenschappen