Exploring the Feasibility of Generating Language-Agnostic Emotion Representations via Speech Emotion Recognition Model
Keywords
Loading...
Authors
Issue Date
2024-08-27
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Speech-generating models have traditionally been constrained by the limited number
of speakers and emotional expressivity they can handle. Expanding these models to
generate expressive emotional speech could improve communication and trust when
interacting with the public, as the agents would seem more relatable. However, cur rent multi-speaker models that can generate emotional speech often require exten sive training data, which is scarce for less commonly used languages, and necessitate
long training times. This thesis investigates a novel approach to circumvent these
limitations by focusing on the generation of robust, language-agnostic emotion em beddings. These embeddings are low-dimensional representations that encapsulate
emotional content, and their effective generation is crucial for achieving high-quality
emotional speech synthesis.
The research explores a neural network architecture capable of generating these em beddings in a language-agnostic manner. The study addresses three main questions:
(1) the feasibility of using a jointly-trained speech emotion recognition (SER) model
to generate quality emotion embeddings; (2) the impact of removing secondary fea tures like speaker and language-specific information on embedding quality; and (3)
the potential of combining the proposed SER model with a spectral conversion model
to perform language-agnostic emotion spectral conversion.
The findings indicate that while the proposed SER model can generate useful language agnostic emotion embeddings, the quality of these embeddings is influenced by language specific factors. Removing secondary features such as speaker and language informa tion did not improve the quality of the embeddings, suggesting that these features
might be crucial for accurately capturing emotional nuances. The study concludes
that although the proposed model shows promise, further research is needed to en hance its performance and generalizability across different languages.
Description
Citation
Supervisor
Faculty
Faculteit der Sociale Wetenschappen
