The University of Sheffield
Department of Computer Science

Jingzhi Liu MSc Dissertation 2015/16

An Application for Emotion Recognition in Speech

Supervised by M.Villa-Uriol

Abstract

Automatic emotion recognition systems can interpret the voice input from users in a more comprehensive way; and furthermore, understand the real purpose of users. An application for emotion recognition from speech has been developed in this project. The application has been designed for recognizing full-blown emotion states in a short clean-recorded English speech from any speaker with any content. The application is structured in modules that include speech recording (mobile app ), feature extraction, feature selection, emotion recognition (classification) and result visualization.

     A first stage to train the system is performed and optimal parameters selected are used for the design of the final application. A comprehensive number of experiments have been performed to assess the accuracy, efficiency and robustness of the proposed system. For the training phase, forward selection is used and compared with one-way analysis of variance selection. Based on binary support vector machine classifiers, two classification frameworks consisting of multiple classifiers have been presented and compared. Combinations of kernel function and method to separate hyperplane for support vector machine classifiers are also compared. Leave-one-speaker-out and n-fold cross-validation are used to evaluate different aspects of the performance.

      A thorough study of the features (prosodic, spectral and voice quality) has been done and their contribution to the emotion recognition has been quantified. Results show that intensity features are most frequently selected in an optimal system. And Mel frequency cepstrum  coefficient features are more comprehensive, and can be used independently in classifying emotions. Ideal performance is comparable to previous studies available in the literature using database LDC2002S28. Results show the effects of over training, where a good and stable performance is achieved on additional speeches from known speakers, but where validity is lost on new (unknown) speakers.

      To increase its reliability, the developed application consists of several modules, some rely on existing software platforms such as Matlab and Praat , and others were developed in Java (controller and Android app ). Average accuracy of final application is above 40%, with  a uniform distribution over emotions and proved stable performance on new speakers. This application has practical value and potential to be further improved by using a larger spontaneous database, speaker adaptation and noise-robust algorithms.