This paper describes a method for generating F0 contours from utterances labelled using the Tilt intonation theory. The method uses classification and regression trees (CART) to predict the five tilt parameters: starting F0, amplitude, duration, tilt, and peak position. Contours generated from by this method from a test subset of an American English database have a correlation of 0.54 and a 33.60Hz RMS error when compared with smoothed versions of the original F0 contour. These results are comparable to other F0 generation methods which use ToBI intonation labels (0.62 and 34.8Hz, 33Hz).
The experiments presented use the Tilt intonation labelling theory to produce natural F0 contours from non-intonation information. The labelling system provides four intonational events, accent, boundary, connection, and silence. The only information necessary to encode these events for generation is accentedness, prosodic phrasing, syllable content, lexical stress and silence.
For each syllable in the database, a set of 40 features was extracted for testing. The features include, with a two-syllable window on either side, accentedness; lexical stress; onset and coda types; Tilt event type; and syllable break values. The features also include the number of syllables, stressed syllables, and accented syllables preceeding and succeeding the syllable within the phrase; distance, in syllables, from the previous and to the next event; the number of non-major phrase breaks since the last major break; onset and rhyme length percent of the syllable which is unvoiced and position of the syllable within a word (e.g. initial, final, medial). All of these features are available to the system at F0 generation time during synthesis. Accent, a simple binary feature, is assumed to have been predicted prior to F0 generation.
To download this paper, please return to Proceedings of the 1997 Postgraduate Conference