Laurence Molloy

Suprasegmental duration modelling with elastic constraints in automatic speech recognition

Although traditional Hidden Markov Model (HMM) systems have proven to be highly successful at acoustic classification they inherit an implausible durational model through the mathematical behaviour of their state transition probabilities. Continuously Variable Duration Hidden Markov Models (CVDHMMs) [1] have partly overcome this problem by replacing the discrete probability associated with a state's self-transition with a continuous durational probability distribution. However, the utility of this treatment of duration is constrained by its assumption of the Markovian principle of independence at the suprasegmental level. This assumption seems to be at odds with previous theoretical studies on segmental duration ([2], [3], [4]) which focus on suprasegmental effects.

This talk presents a method of integrating a model of suprasegmental duration constraints with a HMM-based recogniser through re-scoring the N-Best utterance output. The proposed durational model imposes elastic constraints on the durational behaviour of speech segments.

This concept of elastic constraints is based upon previous work in the field of speech synthesis. The elasticity hypothesis [5] suggests that, within a syllable, phonemes behave like springs of different lengths (mean durations) and elasticities (standard deviations) and follow laws that govern such dynamical systems. That is to say that if the syllable lengthens it is hypothesised that all its constituent phonemes should lengthen in proportion to their elasticities. i.e. the ratio of a phoneme's lengthening to its elasticity remains constant throughout the syllable.

Initial recognition results for the Resource Management (RM) database will be presented, using a "strong" form of the elasticity hypothesis, where it is assumed that the phoneme durations are influenced by elasticity alone, regardless of phonetic and prosodic context. Given the explicit dependence of the elasticity hypothesis on the definition of the syllable unit, the effect of the syllable unit definition will also be investigated.

References

  1. S.E. Levinson (1986) "Continuously Variable Duration Hidden Markov Models for Automatic Speech Recognition" Computer Speech and Language, Vol. 1, pp. 29-45
  2. D.H. Klatt (1976) "Linguistic Uses of Segmental Duration in English: Acoustic and Perceptual Evidence" JASA, Vol. 59, No. 5, pp. 1208-1221
  3. C. Wightman, S. Shattuck-Hufnagel, M. Ostendorf, P. Price (1992) "Segmental Durations in the Vicinity of Prosodic Phrase Boundaries" JASA, Vol. 91, No. 3, pp. 1707-1717
  4. A.E. Turk, L.S. White (1997) "The Domain of Accentual Lengthening in Scottish English" Proc. Eurospeech '97, Vol. 2, pp. 795-798
  5. W.N. Campbell, S.D. Isard (1991) "Segment Durations in a Syllable Frame" Journal of Phonetics, Vol. 19, pp. 37-47
  6. S. Young et. al. (1997) "The HTK Book" Version 2.1 Cambridge University

To download this paper, please return to Proceedings of the 1998 Postgraduate Conference