The goal of this experiment was to generate natural F0 contours from prosodic and syllabic context using the Tilt intonation theory. The results have shown that this is possible. The contours pictured in Figure 2 show great similarity between the original and synthetic contours. While informal listening tests show noticeable differences, these differences do not appear to reflect any important distortion of the utterance.
The features used in predicting the Tilt parameters for this study are all available prior to F0 synthesis. Most are routinely generated as a part of synthesis in modules other than intonation. Accent assignment is seen as a separate step to contour generation, and is outside the scope of this study.
The feature modelling approach taken here is very similar to that in [1]. The features used in this experiment include those in [1] and they are used on the same dataset. While the results are an improvement, our approach has yet to be tested using ToBI labels. Future experimentation will compare the results of the method described here as used on Tilt as well as ToBI labelled data.
The large number of features tested in this experiment exceeds that of previous experimentation. The twenty-four features from [1] are tested here, in addition to seventeen new features. In [6], a variety of segmental, syllabic, and phrasal features are combined with energy to produce similar results. However, the wide range of features used in the experiment may also serve to test specific hypotheses found in the literature. Several hypotheses dealing with peak alignment (cf. [7] [10] [5]) contributed to the list of tested features. Further testing of individual features under a variety of conditions is possible, and may constitute further experimentation.
This work represents an improvement upon previous work. The results suggest accuracy matching or bettering other studies on the same corpus. The average score over 28 test utterances of RMSE of 32.5Hz and correlation of 0.60 compare favorably with a ToBI-based approach [1] (34.5Hz and 0.62) and the dynamical system model [6] (33Hz).