Results

Next: Conclusion Up: Experiment Previous: Accuracy measurement

Results

The first step in predicting the overall contour is to predict each parameter for each event type. Thus, the five Tilt parameters for accents are individually modelled. Table 1 shows the final results for each of the accent parameters. Table 2 shows the boundary results. Note the consistently higher correlation scores in the boundary results. These results reflect the nature of boundaries to fall into categories which are more distinctive than those of accents. Table 3 shows the start F₀ parameter for silences and connections. The lack of systematicity of connections is reflected in the comparably low correlation as compared to that of silences.

Table 1: RMSE and Correlation of ``A'' parameter models

Table 2: RMSE and Correlation of ``B'' parameter models

Table 3: RMSE and Correlation of start F₀(Hz) parameters for connections and silences

The generated F₀ from the optimized models is generally similar to the smoothed original. We get an RMSE of 32.5Hz and correlation of 0.60 averaged over the 28 test utterances. These results are comparable with [1], which uses similar features to predict F₀ from ToBI labels (34.8Hz and 0.62). They are also comparable to [6], which uses a dynamical state system incorporating energy and syllabic context (33Hz).

Figure 2 shows an original smoothed contour (above) and a contour generated from predicted Tilt parameters (below). Note that the large phrase break in the middle of the original contour is not duplicated in the generated contour. This is due to the interpolation through unvoiced segments adjacent to silence in the generated contour. In the original, unvoiced segments next to silence are treated as a part of that silence. The accent and boundary on the phrase ``the policy'' are also worthy of note. In the original contour, the accent spans the phrase, with a sharp drop of F₀ at the boundary. In the generated contour, the accent is restricted to the immediate vicinity of the syllable where the peak is located. While it is possible to hear the difference, the general nature of the utterance is unaffected.

Click image to hear synthesised speech using original contour

Click image to hear synthesised speech using generated contour

Next: Conclusion Up: Experiment Previous: Accuracy measurement

Kurt Dusterhoff
Tue Jul 1 17:33:41 BST 1997