The first step in predicting the overall contour is to predict each parameter for each event type. Thus, the five Tilt parameters for accents are individually modelled. Table 1 shows the final results for each of the accent parameters. Table 2 shows the boundary results. Note the consistently higher correlation scores in the boundary results. These results reflect the nature of boundaries to fall into categories which are more distinctive than those of accents. Table 3 shows the start F0 parameter for silences and connections. The lack of systematicity of connections is reflected in the comparably low correlation as compared to that of silences.
Table 1: RMSE and Correlation of ``A'' parameter models
Table 2: RMSE and Correlation of ``B'' parameter models
Table 3: RMSE and Correlation of start F0(Hz) parameters for connections and silences
The generated F0 from the optimized models is generally similar to the smoothed original. We get an RMSE of 32.5Hz and correlation of 0.60 averaged over the 28 test utterances. These results are comparable with [1], which uses similar features to predict F0 from ToBI labels (34.8Hz and 0.62). They are also comparable to [6], which uses a dynamical state system incorporating energy and syllabic context (33Hz).
Figure 2 shows an original smoothed contour (above) and a contour generated from predicted Tilt parameters (below). Note that the large phrase break in the middle of the original contour is not duplicated in the generated contour. This is due to the interpolation through unvoiced segments adjacent to silence in the generated contour. In the original, unvoiced segments next to silence are treated as a part of that silence. The accent and boundary on the phrase ``the policy'' are also worthy of note. In the original contour, the accent spans the phrase, with a sharp drop of F0 at the boundary. In the generated contour, the accent is restricted to the immediate vicinity of the syllable where the peak is located. While it is possible to hear the difference, the general nature of the utterance is unaffected.
Click image to hear synthesised speech using original contour