Imperfective in Japanese: 'Development by Degrees' and its main Aktionsart 'Degree Achievement'
In comparison with English Progressive, which covers a wide imperfective area (e.g. progressive, development by degrees, future), in Japanese two types of Imperfectives, the BE:stay Imperfective and the COME/GO Imperfective, complement each other to express these areas. In the study of aspect, the aspect 'Develpment by Degrees', as well as its main Aktionsart type 'Degree Achievement', are not well explored subjects, maybe because in English this particular imperfective aspect type is sufficiently expressed in Progressive, an aspect type initially signalling on-going activity. However in Japanese, Degree Achievement favours COME/GO Imperfectives for expressing Development by Degrees because the BE:stay counterpart causes ambiguity. In my presentation, I will discuss the difference between COME/GO constructions and their BE:stay conterparts, then focus on the dual nature of Degree Achievement verbs.
From conceptual structure to grammar
The paper is an exploration of theoretically viable ways of learning the basic principles of syntactic structure, keeping in mind some general empirical facts about first language acquisition. These include a) the large proportion of rote-learnt phrases in early child language; b) the fact that children overgeneralise; c) the observation that the nature of rote-learnt phrases and the nature of overgeneralization errors are not independent; and d) that children eventually recover from the errors.
The theory proposed is an eclectic one. The learning process starts with
the construal of semantic representations of states-of-affairs from conceptual
primitives and some 'syntax of thought.' At the same time, syntactic categories
are assumed to be established on the basis of distributional evidence. (The
significance of this process lies in the discovery of the existence of classes
in language rather than in the labelling of categories.) As the next step,
the argument positions in the semantic structures need to be assigned grammatical
functions and semantic operators need to be associated with syntactic devices.
The linking rules are regarded as being variable both within and across
languages.
Accent Pitch-alignment in "real" speech
This paper describes an analysis of the position of pitch events in a corpus of read speech. It focuses on how the height (in terms of pitch) of rise-fall accents are effected by their position within sequences of accents relative to phrasing constructs.
It is shown that the accents in sequences of accents that occur at the beginning of phrases are generally higher than those in sequences in other parts of the phrase. This initial rising phenomenon is shown to affect more than just the first accent, but does not necessarily affect all of the accents in the sequence.
Trends in language learning in Europe: evidence from the Eurobarometer
Recent writing on EU language planning has identified tensions in EU language policy. The EU has identified language policy both as a means of building a common European identity, and a means of increasing mobility of `people, goods and ideas' around the single European market. However,there may be a tension between these cultural and economic aims. Many writers have called for a more overt language policy. Against this background, there is a shortage of information about actual language learning behaviour in the EU, and how the movement towards European unity is affecting it. This talk will explore what trends in language learning can be observed in the Eurobarometer series of surveys during the 1990s. In particular, it will be suggested that the EU language policy is supporting the spread of English at the expense of other languages, and the implications of this will be explored.
Objective methods for evaluating synthetic intonation
This paper describes the development and evaluation of objective methods for testing synthetic intonation. While subjective methods are available for assessing the quality of synthetic intonation, such tests consume time and resources, and are not useful for day-to-day model development. Therefore, objective measures of F0 modelling are necessary. Currently, objective evaluation of synthetic intonation involves the use of Root Mean Squared Error and Correlation. However, it is unclear how large an improvement in either score must be before it is reflected perceptually. It is also unclear how detailed an analysis these metrics provide. Therefore, two other metrics are to be tested, both of which are similar to a basic RMSE measurement. All of the evaluation results are compared to a perceptual study in order to determine how the objective measures relate to perceived differences in the contours.
Latent Traits in TOEFL and IELTS: An Exploratory Factor Analysis Approach
This study investigated the factorial designs of two general language proficiency tests, TOEFL and IELTS, and the similarity of factor structure(s) across the two batteries using exploratory factor analysis. The tests were given to 178 foreign language learners studying at two different language institutes in Tehran. The analyses were based on data from TOEFL (listening, structure, and reading components), IELTS (listening, reading, and writing components), and EPTB (listening, cloze, and grammar components). The later was used for validation purposes. The results indicate the following: 1) TOEFL resembles a unitary trait structure with a higher-order factor sharing more than %63 of the total variance and two less significant second-order factors, i.e., listening and reading, 2) IELTS also resembles a unitary trait structure with a higher- order factor sharing more than %52 of the total variance and two second- order factors, i.e., receptive skills (listening and reading) and productive skill (writing ability), 3) Across battery analyses show a different pattern of a general higher-order factor sharing more than %44 of the total variance and three second-order factors: reading, listening, and writing. It was also found that despite employing different test methods, there are great similarities between latent traits of listening and reading in the two batteries and that the batteries are based on a theory of divisible language skills.
Some functions of cleft constructions in academic writing
Cleft constructions are said to be useful in writing because they are a means of signalling appropriate stress and intonation, of enabling information density to be combined with high communicative dynamism, and of permitting new information to be presented as established and uncontroversial fact. They are considered to be particularly suited to highlighting the discourse theme, to persuasive writing and to historical narrative. Two basic types have been identifed in the literature, each associated with different functions. In one type the complement of the copula conveys new information and receives the main stress in the other, this element conveys given information and receives weak stress or is unaccented, the main stress being on the that-clause, which carries the main message (Delin 1989, Prince 1978). It has been claimed that the former type of cleft is used to close off a discourse segment while the latter type introduces an open ended embedded segment (Delin and Oberlander 1992). However, examination of the use of clefts in academic history journal articles reveals cases where given information receives the major stress and others where clefts of the second type do bring a discourse segment to a close.
Transcription and initial analysis in a language classroom discourse study: issues, options and questions
This talk will describe the aims and methodology of a language classroom discourse study, and will look at some of the issues related to transcription and analysis of audio and video taped classroom discourse and participation. It will look at the some of the possible approaches to transcription, and how decisions regarding transcription conventions may influence subsequent analysis, paying particular attention to the inclusion of extralinguistic description. Using an example from the study, it will suggest how transcribed speech from a classroom setting might be combined with observed and recorded behaviour in an analytical framework, cross referenced with participant interviews. As the study is at the initial stages of analysis, feedback and suggestions from the audience would be particularly valued.
Consonant Types and Vowel F0 in Korean
The aim of this research is to see if the fundamental frequency (F0) values of post-consonantal vowels can be used as a parameter for automatic speech recognition of Korean.
Some previous phonetic studies have revealed a language independent general tendency for the F0 of a vowel change depending upon the type of preceding consonant. Tense and aspirated stops, for example, raise the F0 value of the following vowel while lax stops lower it. As shown in a cross-linguistic study (Jun 1996), this micro-prosodic factor appears to be more salient and consequently more useful, in the languages like Korean and Japanese which have phonological contrast between 'Lax' and 'Tense' consonants, or/and between 'Aspirated' and 'Non-aspirated' consonants, than in the languages like English and French in which such variation emerges only at the phonetic level.
I am investigating whether significantly distinctive values of the post-consonantal F0 can be obtained from a natural speech corpus rather than highly controlled data used in most previous experiments in order to rule out the effect of other factors. Then I will discuss how the result can be implemented to the baseline speech recognition model. The method used to obtain reliable phone labels semi-automatically will also be mentioned.
Is there a high onset pitch accent in Warlpiri intonation?
This paper postulates a distinction between two types of high pitch accent in the intonation systems of two Australian Aboriginal languages, Warlpiri and Dyirbal. Analyses carried out in this preliminary study suggest that in both languages there may be two types of high accent which occur in contrastive distribution in clause-initial position: a high-rising pitch accent (which peaks after the onset of the stressed vowel) and a high-onset pitch accent (in which the F0 peak occurs at the onset of the accented word). The acoustic properties of these two high pitch accents are discussed in attempt to determine whether the data contains two intonational categories or gradient variations in the realisations of the same category.
Segmental anchoring of tones as a word-boundary correlate in English
This paper reports the results of an experiment examining the possibility that the alignment of F0 minima forms a word boundary correlate in certain phonological contexts. The effect of boundary location on the alignment of F0 minima between two F0 peaks was examined in word pairs which had either an ambiguous consonant (Jay Neeson vs. Jane Eason) or an ambiguous syllable (Al Maloney vs Alma Lonie). It was found that there was a strong effect of boundary location when consonants were ambiguous, with early boundaries leading to early alignment. There was no effect of boundary location when syllables were ambiguous. This finding reinforces earlier findings that the beginnings of F0 rises are consistently aligned relative to the accented syllable. Furthermore, this lawful behaviour of interpeak valleys poses challenges for existing inventories of tonal targets, which propose that interpeak valleys are merely sagging transitions.
If when, why, and, then what? Toward cognitive evidence for a parameterisation of coherence relations
Coherence relations are cognitive devices that mark the relation between adjacent text units. They can be marked by a conjunction or can remain unmarked. The literature on coherence relations remains rather unclear about how different kinds of conjunction s affect the comprehension process. This talk addresses this question. A parameterisation is proposed that is based on existing taxonomies of coherence relations generating a total of twelve kinds of coherence relations depending on the setting of the parameters. Two sets of parameters are defined: TYPE (parameters: CAUSAL, TEMPORAL, ADDITIVE), POLARITY (parameters POSITIVE and NEGATIVE). A series of experiments have been carried out to measure the effects of the different categories of coherence relations on processing text. In the first experiment readers were asked to read three-line paragraphs of the kind:
My neigbour played saxophone
I did not like it
[conjunction] he practised every day
The third line was preceded by one of the following conjunctions: because, although, later, until, what's more, however. In the second experiment the third line was not preceded by a conjunction and subjects were asked to point out which conjunction they used. In this talk the results of two reading time experiments will be discussed in the light of the proposed parameterisation. Processing time in both experiment provided cognitive evidence for the different categories of the parameterisation.
Kalman Filtering for Speaker Characterisation
This paper describes a method for obtaining smoothed vocal tract parameters from analysis during the closed phase of the glottis. The method is based upon Expectation Maximisation (EM) and uses Kalman-Rauch forward-backward iterations through a voiced segment, in which the speech data during excitation and open phases are excluded by treating them as `missing data'.
This approach exploits the non-independence of neighbouring spectra and compensates for small numbers of available points, while preserving speaker-characteristic information and tracking variations in it.
The vocal tract filter parameters are then used for inverse filtering the speech, thus obtaining estimates of the source excitation. The extracted excitation signal can be used to excite other sets of parameters to produce natural sounding speech.
The applicative construction in Chishona
In Bantu linguistics, the applicative construction has received a lot of attention and the general consensus is that 'normal applicatives' add an extra object to the argument structure of the verb. Such arguments are typically interpreted as beneficiary or instrumental (Baker 1988), Bresnan and Moshi (1990) and Alsina and Mchombo (1993) and are analysed as full direct objects. However, in ChiShona, a Bantu language spoken by the majority of the population in Zimbabwe, we find that this analysis only gives part of what the applicative can do. As has already been noted by others, the applicative construction can be associated with other thematic roles such as the locative, goal, source and so on (Kimenyi 1980, Ngonyani 1998). However, these types of applicatives have typically been ignored by papers dealing with theoretical issues partly because they do not fit easily into the theoretical analysis posited. In this paper, I present evidence that in addition to its behaviour in providing an extra benefactive argument, the applicative affix may, in certain circumstances alter the syntactic properties of an argument or of an adjunct that is already licensed by the verbal stem. In other words, outer relations of a verb are 'internalised'. The data will show that locative and goal applied objects are not additional arguments and that the semantic effect of the applicative is to focus on certain aspects of the situation described by the verbal stem.
Parallel paths in first and second language acquisition: evidence from the use of case markers in Modern Greek native and non-native grammars
The aim of this paper is to give an account for similarities and differences in the developmental patterns of first (L1) and second (L2) language acquisition, with respect to the acquisition of case marking in Modern Greek (MG). MG is an inflected language with articles, (pro)nouns, and adjectives being morphologically marked for case, gender and number by fused endings. No bare stems ever appear in normal adult production. The question that will be addressed then is whether early markers of MG child and adult interlanguage grammar are identical with those of the target MG grammar. Production data will be presented from two groups: children acquiring MG as a native language, and L2 adult learners acquiring MG as a non-native language. The latter group is divided into two subgroups, depending on whether the L1 exhibits overt (e.g. Russian) or covert (e.g. French) case marking. Preliminary findings seem to point to certain similarities between MG child and adult interlanguage at the level of syntax, as for example the use of the definite article as a case marker of whole NPs rather than as a marker of definiteness. Similarities are also found at the level of lexical knowledge; L1 and L2 acquirers seem to store word-forms in the same way, assigning a default value in nominative case in the mental lexicon. This default form sometimes surfaces during language production, thus obscuring our understanding of the acquirers' syntactic competence.
Orthography and Sound Changes in Compound Kanji for L2 Learners of Japanese
Japanese orthography contains meaning-based kanji (Chinese borrowing words), which are generally considered to form the most problematic and time consuming area of study for a learner of Japanese as a foreign language.
Generally, kanji in Japanese require multiple readings of on-yomi (Chinese reading) and kun-yomi (Japanese reading). Moreover, in the case of compound kanji, a direct application of on-yomi or kun-yomi of each element of a compound kanji often results in an incorrect reading due to phonological changes. Understanding sound changes in kanji compounds occupies a huge space in L2 orthography teaching, but current kanji textbooks and reference books do not provide a systematic coverage of sound changes in kanji compounds, and L2 learners are left alone to remember the reading of compound kanji one by one.
In this paper, I shall first look at examples of phonological changes in kanji compounds: gemination and Renjo (liaison) phenomena are well-known with on-yomi; Rendaku (sequential voicing) and morpheme final vowel alternations are typical sound changes with regard to kun-yomi. Secondly, I shall consider how Japanese orthography represents these sound changes. I shall demonstrate that it is possible to account for sound changes orthographically, rather than using phonetic symbols, and I shall then outline a number of practical strategies for L2 learners to deal with sound changes in kanji compounds. In conclusion, I claim that sound changes in kanji compounds can be simply as well as systematically explained to L2 learners of Japanese who have no knowledge of Japanese linguistics or phonology.
"There's a hole in my bucket. How can I fix it?: predicting phoneme duration distributions for unseen prosodic contexts in the TIMIT database - an initial study"
Speech modelling with general-purpose speech databases typically involves
a trade-off between model accuracy and sparseness of data. For example,
if we wanted to model the duration of
--- nuclear-accented [ae] in phrase-final "bat" ---
a lack of data may force us to generalise the prosodic context to
--- accented [ae] in "bat" ---
Whilst this model can be contrasted with that of a non-accented [ae] in
"bat" or an accented [ae] in "bad", it prevents us from
modelling how phrase finality or type of accent affects a phoneme's duration.
However, previous proposals for modelling duration through the use of systematic prosodic relations (Klatt 1976, Van Santen 1998) present us with an alternative solution to the data sparsity problem (at least for modelling duration) - namely shifting and resizing a phoneme's duration distribution in an existing prosodic context to model that phoneme's durational behaviour in an unseen prosodic context.
In its strongest form, this approach assumes that there exists an "average" effect that going from one prosodic context to another has on a phoneme's duration. eg. the duration of a phoneme in an accented syllable environment is longer and more variable than the same phoneme in an identical, but unaccented syllable environment. In percentage terms, the accent-related increases in mean duration and variability are assumed to be the same for all phonemes in this environment. A weaker form of this approach would allow these accent-related increases to be different for different phoneme "types" e.g. vowels and consonants.
In this talk I present an initial analysis of the durational behaviour
of phonemes in the TIMIT corpus with a view to mathematically quantifying
such systematic relations that may exist between prosodic contexts in this
data. The mathematical models obtained in this way will then be used to
fill in the "holes" in the data. The results of this study are
intended to be utilised as a duration model for use in speech recognition.
Pitch range modelling: linguistic dimensions of variation
This paper reports on a large scale study of pitch range variation across speakers. The experiment examined the relation between a model of pitch range based on pitch level and pitch span with the perception of speaker characteristics. The key to our measure of range is that it is based on clearly defined linguistic targets in speech. These targets included sentence initial peaks, accent peaks, post-accent valleys and sentence final lows. My data are based on the ratings of 48 listeners judging 32 speakers of British English. The results show that a pitch range model based on linguistic dimensions of variation better captures variation in listeners' judgements than the well established measures based on speakers' long term distributional properties of f0, such as +/- 2sds around the mean, 95th-5th percentile and 90th-10th percentile.
The psycholinguistics of grammatical gender
Most theories of language production converge on the assumption that lexical selection proceeds in two stages. In the first stage, a semantically and syntactically specified representation (lemma) is accessed while in the second, a phonological representation (lexeme) is accessed. By contrast, there is far less agreement concerning on the one hand, the precise nature of the information represented at each level and on the other, the relationship among these representations and the manner in which they are selected.
Unlike number, which is typically part of the conceptual message and unlike case, which is assigned on the basis of structural considerations within the clause, grammatical gender constitutes an inherent property of a noun, encoded as part of its syntactic specifications. The study of how grammatical gender is retrieved in the course of speaking is thus intertwined to the more general question about the content and the mechanisms of access to lemma level representations.
Previous studies have yielded different patterns of results associated with comprehension and production tasks and more importantly, with different languages, suggesting that different access mechanisms may be implicated in languages with different morpho-phonological properties. The potential interplay between morphological - syntactic/grammatical features is examined in a picture naming experiment in Greek with particular emphasis being laid on the way gender marking is realised in the target language. Some preliminary results will be presented and their implications for theories of lexical access will be discussed.
An articulatory basis for speech recognition
Debate has long raged among researchers interested in human speech perception as to the role of articulation and articulatory knowledge in decoding the speech signal. At one extreme, there are researchers who maintain that articulation is vital for speech signal decoding. At the other extreme, human speech perception is often argued as relying solely on acoustics. Naturally, it has been suggested that such debate should be of interest to speech technology researchers, as any insight gained into human speech perception could potentially be applied to automatic speech recognition (ASR) systems.
However the difficulties in studying this area are formidable. For example, at a fundamental level, it still remains contentious whether articulation can in fact be recovered from acoustics with sufficient accuracy to be of any use.
The relatively recent development of techniques making it much more convenient to record and study human articulation offers great scope for investigating all manner of questions involving articulation. Electromagnetic articulography (EMA) is a prime example. Using EMA, the movements of small receiver coils fixed to a speaker's articulators can be measured and recorded in parallel with speech acoustics.
This paper will cover present work which attempts to derive a partial acoustic-to-articulatory mapping using EMA data and neural networks. To place this work in context, we will first outline the factors motivating this, taking into account the relationship between human speech perception research and ASR system design. We also cover some of the previous work on using articulatory knowledge to improve ASR performance. The neural network experiment itself will then be described, and some of the results obtained so far presented and discussed.
A bi-directional study on child second language acquisition of temporal morphology
This paper presents new data on the role of lexical aspect in child SLA of tense-aspect morphology.
Studies of L1 and L2 acquisition have indicated that learners interpret verb morphology as marking lexical aspect rather than tense in itself. According to the "aspect hypothesis", progressive/imperfective marking is initially restricted to durative and stative predicates, whereas past and perfective marking emerges primarily with telic and punctual predicates. There are only a few relevant child SLA studies. What is new here is the detailed inclusion in the debate of "bi-directionality" in a longitudinal setting: L1-Italian children learning English in England compared with L1-English children learning Italian in Italy.
The aim of this study is to analyse learning symmetries and asymmetries in both groups of learners. Importantly, these results, by using child L2 data, help bridge the gap between L1 acquisition and adult L2 acquisition, a goal often called for in the literature.
Globules or cobwebs or something else: towards a possible model of lexical retrieval in bilingual readers.
An investigation of how 120 teenage Chinese/English bilingual readers comprehended a verb, a noun and a metaphor in a Chinese text. The aim is to see if this could shed light on interactive reading processes as applied to Chinese text. The data are parallel translations made by 120 teenagers , all in the British (English) education system, of a short Chinese text. A brief justification of the use of translation as data is followed by an overview of approaches to the theory of meaning and the ways in which lexis may be accessed. These include the theory of referential meaning, componential analysis, prototypical features, models of parallel distribution and the idealised cognitive model.
The data presented consist of a range of interpretations of 'moni' (to imitate), 'zawen' (literary essay), and a set of three martial terms used metaphorically across a sentence in the text to apply to literary composition. The evidence from the data and the theoretical support of the literature lead me to conclude that comprehension of the meaning of lexical items is influenced to a great extent, on the one hand by personal experience and attitudes, and on the other, by recovery of meaning from the context. Competent readers are more likely to achieve a consensus, or near-dictionary equivalent, which may be regarded as central on a continuum, while less able readers retrieve meanings which gravitate towards extremes of the continuum. Miscues at orthographic level may lead to a non-consensus reading and non-consensus readings may be woven into a pattern of coherence which is at odds with the author's intention.
Word-level durational effects in speech
It has long been suggested that there is an influence of the phonological length of a word upon the actual duration of its constituents in speech. This is typically expressed as an inverse relationship between the number of syllables in a word and the duration of the primary stressed syllable in that word. For example, the duration of the syllable "ten" should be greatest when it is a monosyllabic word, less when observed in the disyllable "tendon", and even less in the trisyllable "tendency".
This hypothesised effect has sometimes been confounded with other influences upon speech segment duration, in particular, foot-level shortening and phrase-final lengthening. Foot-level shortening refers to a tendency for the duration of a stressed syllable to decrease as the number of unstressed syllables separating it from the next stressed syllable increases, regardless of an intervening word boundary. Phrase-final lengthening is found to affect in particular the rhyme of the final syllable before a prosodic boundary, with greater lengthening at the boundaries of higher level prosodic constituents.
The present research is examining durational effects at the word-level whilst controlling the influence of other factors such as foot length and phrase finality. By looking at the effect of unstressed syllables which precede and follow the main stress within the word, it is intended to determine whether word-level durational influences do indeed operate across the whole span of the word or are best characterised as domain-edge effects. How such influences interact with lengthening due to pitch accent is also examined.
Synthesising intonation contours using hidden Markov models
This paper presents a method of generating utterance type specific intonation contours using a stochastic method, namely hidden Markov models (HMMs). The stochastic models capture the variation of intonation contours associated with one utterance type. For example a yes/no question frequently has a high boundary tone, but sometimes has a low one. The categorisation of utterance types is based on the theory of conversation games and consists of 12 move types (e.g. reply to a question, wh-question, acknowledgement). Hidden markov models are probabalistic finite state networks and are both generators and acceptors of sequences of events. In acceptor mode they can be used to give the likelihood that an observed sequence of events has been produced by that model. This is implemented for the automatic recognition of move type [2]. In the application presented here, they are used in generator mode to specify the distribution of intonation events. Each HMM is trained on a sequence of observation vectors consisting of 4 tilt parameters extracted from automatically identified intonation events, as described in [1]. One HMM is trained for each utterance type and its transition and observation probabilities are used to generate a sequence of intonation events. This system would be particularly useful in human-computer interaction systems where the type of utterance to be synthesised is known to the conversation agent.
Exceptional Case Marking constructions in Old English
Verbs of causation and perception in Old English regularly show the typical patterns of Exceptional Case marking constructions. In spite of syncretism in case marking for nominative and accusative nouns and absence of overt anaphors and expletives, consideration of argument structure proves that accusatives are subjects of infinitives rather than matrix objects.
From a minimalist point of view, they seem to provide a convincing evidence for AgroP as well as Object-Shift in Old English in that subjects of infinitives overtly move or are attracted to matrix clauses for case checking. Non-AgrP-based analysis, however, can equally account for their syntactic distributions.
They can also serve as a criterion to decide the underlying order of Old English. The fact that ECM verbs never appear in the final position of subordinate clauses when followed by infinitival or clausal complements can be accommodated within SOV analysis only with the assumption of clause-union and verbal projection raising.
Plain infinitival complements in ECM constructions, moreover, sharply contrast with inflected infinitival ones in Control verbs in Old English. Plain infinitives have no inflectional element to check the case of their subjects, whereas to in inflected infinitives can license and check the case of empty pronominal subjects. Therefore, we can shed some light on the asymmetry in case checking by to between ECM and Control verbs in Modern English from a historical point of view.
Tonal alignment in Cantonese
Understanding how F0 contours align and scale is important in speech production and perception. Bruce (1977) shows how alignment of pitch targets with the segmental material distinguishes Swedish word accent I from word accent II. Similarly, Armalia, Ladd and Mennen (1998) exhibits consistent pitch alignment of H and L targets with the segmental string in Greek. These studies seem to suggest a level pitch target, rather than a dynamic target, to be primary.
The current study preliminarily explores the interaction of prosodic contexts in the alignment of pitch targets in a tone language, Cantonese (a Chinese dialect which makes 6 tonal contrasts with both level and dynamic F0 contours). A preliminary analysis of pitch alignment in all prosodic contexts will be compared and presented.
Statistical analysis of Korean stop durations and its application for speech recognition
In this paper, we investigate durational characteristics of Korean stop sounds from the KAIST Continuous Speech Database, and discuss ways in which these characteristics can be used in automatic speech recognition.
Previous phonetic studies of Korean oral stops have reported that there is a durational difference between Tense vs. Lax stops, and Aspirated vs. Unaspirated ones. Tense stops have much longer closure durations. On the other hand, Voice Onset Times (VOTs) of Aspirated stops are observed to be much longer than those of their counterparts.
However, these phonetic studies were carried out on data, which are phonetically controlled, to prevent other linguistic factors from interfering in the experiments. For the task of speech recognition, we need to know whether the same picture emerges when these other factors are at work.
We built a baseline Hidden Markov Model(HMM) speech recogniser for purposes of comparison. In the baseline recogniser, the closure and VOT of each stop are combined into a single event. Phone models are initialised and trained with 628 hand-labelled data. In a second version, closure and VOT durations are trained separately. Using the second version in forced alignment mode, models are trained with remaining training data with no hand labels. The durational statistics on closures and VOTs are supplemented with the automatically segmented data thus obtained.
We are designing and testing a third recogniser which uses spectral data just to identify stop places, and distinguishes manner using closure and VOT. Its performance will be compared to that of the baseline recogniser which treats all stops individually.
Last updated: 19th May 1999 Caroline Heycock