Exploring the Feasibility of Generating Language-Agnostic Emotion Representations via Speech Emotion Recognition Model

Keywords

Loading...
Thumbnail Image

Issue Date

2024-08-27

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

Speech-generating models have traditionally been constrained by the limited number of speakers and emotional expressivity they can handle. Expanding these models to generate expressive emotional speech could improve communication and trust when interacting with the public, as the agents would seem more relatable. However, cur rent multi-speaker models that can generate emotional speech often require exten sive training data, which is scarce for less commonly used languages, and necessitate long training times. This thesis investigates a novel approach to circumvent these limitations by focusing on the generation of robust, language-agnostic emotion em beddings. These embeddings are low-dimensional representations that encapsulate emotional content, and their effective generation is crucial for achieving high-quality emotional speech synthesis. The research explores a neural network architecture capable of generating these em beddings in a language-agnostic manner. The study addresses three main questions: (1) the feasibility of using a jointly-trained speech emotion recognition (SER) model to generate quality emotion embeddings; (2) the impact of removing secondary fea tures like speaker and language-specific information on embedding quality; and (3) the potential of combining the proposed SER model with a spectral conversion model to perform language-agnostic emotion spectral conversion. The findings indicate that while the proposed SER model can generate useful language agnostic emotion embeddings, the quality of these embeddings is influenced by language specific factors. Removing secondary features such as speaker and language informa tion did not improve the quality of the embeddings, suggesting that these features might be crucial for accurately capturing emotional nuances. The study concludes that although the proposed model shows promise, further research is needed to en hance its performance and generalizability across different languages.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen