Mind the Linguistic Gap: Studying the learning of linguistic properties of continuous sign language videos in an isolated sign language recognition task
Issue Date
Document type
Journal Title
Journal ISSN
Volume Title
This research carried out in this thesis makes an initial investigation into whether a deep neural
network, more concretely a 3D convolutional neural network (3D-CNN), is able to learn any
aspect of continuous sign language (SL) linguistics in an isolated sign language recognition
(ISLR) task. To do so, we use Dutch SL or Nederlandse Gebarentaal (NGT) data from the Corpus
Nederlandse Gebarentaal (CNGT) and NGT Signbank. We define a Linguistic Gap (LG) as
the difference between SL linguistics knowledge and the observable linguistic properties learnt
by the classifier. We hypothesize the existence of a LG in the difference between the intrinsic
dimension (ID) of the 1024-dimensional neural representations found in the last hidden layer
of our classifier and the 21 theoretical dimensions of NGT we derive from linguistic specifications
in NGT Signbank. To study the LG effectively, we design a new linguistically centered
methodology in which the effect of linguistics on the classification is showcased. Given the
isolated nature of the sign language recognition (SLR) task, we determine that phonology is
the most straightforward linguistic aspect to study in this work. Thus, we use the phonological
difference between pairs of signs to design and evaluate different experiments that approach
the binary classification task from a linguistic perspective.
We freeze all layers in the model except for the last hidden layer while fine tuning on our
SL data. This confines the potential linguistic knowledge acquired by the network to this last
hidden layer, which allows us to study the ID in relation to linguistics. To the best of our knowledge,
we present the first application of ID on video data and on representations learnt on SL
data. To extract the ID of the neural representations, we use the maximum likelihood estimation
(MLE) and Two-Nearest Neighbours (TwoNN) algorithms, which are the only recorded
applications of ID estimation on image data and on neural representations of image data. We
carry out three experiments, in which we compare the classification of minimal pairs, i.e., two
signs with different meaning that differ only in one phoneme, with non-minimal pairs. The first
experiment highlights the effect of phonological difference between pairs of signs on the LG
when a maximum amount of data is available. We compare four classifiers trained on the two
most frequent non-minimal pairs and the two most frequent minimal pairs in the dataset. The
second experiment keeps the best-performing minimal pair and non-minimal pair to study the
effect of input data resolution on the LG. In the last experiment, we expand on the concept of
minimal pairs and make a first introduction of phonological distance, which gives us a measure
of the phonological difference between non-minimal pairs. We study the effect of this distance
between pairs of signs to gain further insight into how the network incorporates SL linguistic
knowledge in the classification.
In these experiments, we discover through the calculation of ID that the last hidden layer
of our I3D model is capable of representing SL data in latent space as effectively as the repre-sentations made by linguists in NGT Signbank, albeit remaining highly over-represented with
respect to the dimensionality of its feature vector. We also observe that the ID of the neural
representations in this layer is not sensitive to phonology of signs, but to other aspects such as
spatial and temporal resolution of the input data. These initial results suggest that, in opposition
to our initial hypothesis, the LG does not lie in the difference between the ID of the neural
representations and the theoretical ID of NGT. Finally, through the study of our phonological
distance measure, we discover that the classification performance of the I3D model increases
with increasing phonological distance between the classified pairs of signs, suggesting that
knowledge captured by the network is related to phonology, among other visual aspects of the
This research contributes to the field of interpretability of SL technologies through the study
of phonological aspects of SL in the representations of the last hidden layer of a binary classifier
in an ISLR task. We discuss the implications of understanding how a deep neural network performs
classification to improve performance and interpretability of SL systems and encourage
research to further study linguistics and its impact on them.
Faculteit der Sociale Wetenschappen