The goal of the work presented here is to build an automatic intonation recognition system and use it to predict utterance types in spoken dialogue.
The data are a subset of dialogues from the Canadian Map-task corpus, consisting of spontaneous goal-driven dialogues collected from a number of speakers of Canadian English. This corpus has been annotated using a dialogue analysis scheme based on the theory of game moves first developed by Power (1979) and adapted for maptask dialogues by Carletta (1995). Each utterance in a dialogue is classified as belonging to one of 12 moves types such as query-yes/no, instruct, acknowledge etc. It is the goal of the system to use intonation to automatically predict which one of these types an utterance belongs to.
These data are hand-labelled with the intonation events:''a'' (pitch accent), ``b'' (boundary tone), ``ab'' (for when an accent and boundary co-occur), ``c'' (for "connection" - the white space between intonational events) and ``sil'' (silence). A system of HMMs is then used to relate intonational events to game moves. In this HMM system, the observations consist of sequences of intonational events, given in terms of 4 continuous variables: amplitude, duration, position and shape of the event.A three state, left-right continuous density HMM was trained for each of these types of utterances. All the HMMs are run over each utterance and the HMM which matches the utterance closest is chosen as the answer. The system produces a ranked list of move types for each utterance. The correct move was ranked first 44% of the time, in the pilot study. Current speaker independent tests using more data show a slight decrease in this accuracy rate.
Each of the HMM states can be thought of as capturing a different part of the intonation contour. In order to investigate this phenomenon the second state of the HMM was forced to model pre-nuclear events, the third state to model only the nuclear accent, the fourth to model the boundary tones in accordance with the British School's theory of intonation. (See Palmer 1970.)
Further experiments are to be carried out to investigate whether Pierrehumbert's system comprising of pitch accents, phrase accents and boundary tones, would provide a better model. In order to utilise the set of ToBi labels, a set of discrete HMMs would be used to represent the different types of intonational events.
The utterance type recogniser was combined with an n-gram model. This makes use of the fact that game moves follow one another with some degree of predictability (e.g. a reply-yes or reply-no is the most common response to a query-yes/no). The goal of this, as with any recognition system's language model, is to reduce the possibilities that the intonation model needs to consider, thus making it more likely to arrive at the correct answer.
Other phenomena that will be investigated include the clustering of moves which are recognised using similar HMM models, providing a finite number of tunes that can be mapped onto a number of different intonation meanings.
As well as providing a tool for investigating the many theories of intonational phonology, such a system also has important practical applications. One of the main uses is in automatic speech recognition, where we use knowledge of the utterance type to guide the recogniser's expectations of what might be said.
To download this paper, please return to Proceedings of the 1997 Postgraduate Conference