Emotional Speech in Human-Computer Communication

Maciej Karpiński (Adam Mickiewicz University, Poznań)

The mechanisms of emotional behavior in humans have been systematically examined since the advent of contemporary psychology [1], [2]. Presently, emotions are studied not only in the context of basic survival, but also in social interactions and as a supplement to human intellect [3], [4]. Since emotions strongly affect human communicational and linguistic behavior, emotionality may shortly become a vital part of many speech-based human-computer communication systems [5], [6], [7]. While many meaningful and revealing studies have been carried out in the fields of emotional expression in spoken language and many relevant features of emotional have been determined, more research is still urgently needed. This especially applies to the studies of naturally occurring speech (as opposed to “laboratory” emotional speech, produced consciously by professional speakers), complex (mixed) emotions, and the implementation of emotional speech engines in machines and environments [8], [9], [10].

A substantial part of the present knowledge about emotional speech comes from corpus-based studies. The design and preparation of emotional speech corpora and databases is a demanding task [8], [10], [11]. In the corpora of naturally (spontaneously) occurring speech, usually only a small proportion of utterances can be clearly classified as expressing certain emotions. Moreover, emotional labeling itself poses serious problems, because “pure” emotions are rarely met while their mixes are often quite complex in description [12], [13]. Emotional categories and possible hierarchies of emotions still remain controversial [3].

While lexical and syntactic properties of emotional speech are equally important, this paper is focused on its acoustic-phonetic features which are discussed on the basis of a number of contemporary studies. Special attention is paid to the suprasegmental component, with intonation as an extremely rich information source.

Pitch parameters seem to be relatively easy to track instrumentally with existing phonetic software. Pitch range, average pitch level, as well as the character of pitch changes in time may be important cues to the emotional content of an utterance [14], [15], [16]. However, the final shape of intonational contour is determined by many factors related to the utterance itself, speech situation, and the speaker. In emotional speech, normal influences of these factors may be disturbed, leading to general comprehension problems. Loudness and tempo (especially their changes) may also provide valuable information about emotions conveyed in speech signal (e.g., [12]). Speech rhythm (and its disfluencies) may also prove revealing in terms of emotional information. Finally, voice quality (e.g., harshness, breathiness, laryngealization, brilliance) [17] and segmental phenomena [18] are also relevant components of emotional speech to be included in its general model. Most of contemporary emotional speech synthesizers make use of these parameters [17], [18], [19]. Obviously, emotional speech recognition poses much more problems [20], [21].

Naturally occurring emotional speech results from the action of certain underlying human emotional mechanisms [24]. Accordingly, emotional robots and virtual agents should be provided with software that would enable them to simulate emotional behavior and, consequently, to produce emotional speech in a contextually relevant and communicationally meaningful way. However, providing a machine with such abilities may mean giving it consciousness and a question arises whether we really need or want it [25].

Emotional speech, intonation, human-computer communication

