Speech Emotion Recognition using Self-Supervised Features

Speech emotion recognition (SER) can be applied in phone middle dialogue investigation, mental health, or spoken dialogue techniques.

Audio recordings can also be used for automated speech emotion recognition.

Audio recordings can also be utilised for automatic speech emotion recognition. Image credit history: Alex Regan by using Wikimedia, CC-BY-2.

A latest paper printed on arXiv.org formulates the SER problem as a mapping from the continuous speech domain into the discrete area of categorical labels of emotion.

Scientists use the Upstream + Downstream architecture product paradigm to allow for uncomplicated use/integration of a massive wide range of self-supervised characteristics. The Upstream is pre-skilled in a self-supervised vogue is accountable for aspect extraction. The Downstream is a job-dependent product, which classifies the capabilities created by the Upstream product into categorical labels of emotion.

Experimental effects demonstrate that despite utilizing only the speech modality, the proposed technique can attain benefits similar to all those realized by multimodal units, which use both Speech and Text modalities.

Self-supervised pre-experienced capabilities have consistently delivered state-of-art success in the discipline of all-natural language processing (NLP) even so, their deserves in the area of speech emotion recognition (SER) however want even further investigation. In this paper we introduce a modular Close-to- Stop (E2E) SER method dependent on an Upstream + Downstream architecture paradigm, which makes it possible for easy use/integration of a significant assortment of self-supervised functions. Various SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are executed. These experiments examine interactions among the fine-tuning of self-supervised feature models, aggregation of body-degree capabilities into utterance-level attributes and back again-conclude classification networks. The proposed monomodal speechonly primarily based technique not only achieves SOTA final results, but also brings light-weight to the likelihood of effective and effectively finetuned self-supervised acoustic attributes that reach effects equivalent to the final results obtained by SOTA multimodal systems working with both equally Speech and Textual content modalities.

Investigation paper: Morais, E., Hoory, R., Zhu, W., Gat, I., Damasceno, M., and Aronowitz, H., “Speech Emotion Recognition using Self-Supervised Features”, 2022. Website link: https://arxiv.org/ab muscles/2202.03896