Get Complete Project Material File(s) Now! »
Workshop of “Modeling pronunciation variation for ASR”
Strik and Cucchiarini reviewed the workshop in [1998] and also gave an overview of the literature on modeling pronunciation variation for ASR in [Strik and Cucchiarini, 1999]. The review provides a special focus on the sources of information concerning variations in speech. The major approaches to derive information on pronunciation variation are the data-driven and the knowledge-based methods. Both methods need manually (e.g. [Riley et al., 1999; Saraçlar and Khudanpur, 2004]) or automatically (e.g. [Adda-Decker and Lamel, 1999; Wester and Fosler-Lussier, 2000]) transcribed data from the acoustic signals in order to obtain information. The data-derived approaches are that pronunciation variation information can be directly obtained from acoustic signal data. In contrast, the knowledge-based methods need to get pronunciation variation information from sources that already exist in the linguistic literature with phonological or phonetic knowledge.
The second considered aspect is the types of pronunciation variation: within-word and cross-word variation. As indicated in section 1.1.3, the ASR system uses a pronunciation dictionary, called also lexicon, pronunciation variation modeling is mostly generated in the lexicon. At the word level, one may notice that a word can have several pronunciation candidates from canonical pronunciation to its variant(s). Variants can occur by substitutions, insertions and deletions of phones or phonemes related to canonical pronunciation. This type of variation is within-word variation. As for cross-word variation, multiwords (sequences of words) are treated as one entity in the lexicon [Sloboda and Waibel, 1996; Riley et al., 1999]. The study of pronunciation variation about multiwords is emphasized in [Binnenpoorte et al., 2005] to improve automatic speech recognition and automatic phonetic transcription.
A particular attention is dedicated to the representation of the information concerning the pronunciation variants. That is to say, the pronunciation variation information can be formalized or not [Strik and Cucchiarini, 1999]. In a data-driven method, the formalizations are done by e.g. rewrite rules, decision trees, artificial neural network, or phone confusion matrix. In a knowledge-based method, the formalizations of pronunciation variation information are extracted from linguistic studies. The obtained formalizations are generally added in the lexicon as optional phonological rules such as substitutions, insertions and deletions of phones or phonemes. The alternative choice is not using formalizations. It means that all possible variants are listed in the pronunciation dictionary without being generated by some rules.
Pronunciation variation modeling for French
Fouché [1969] described pronunciation variants in French as a consequence of several factors, like speaking style, speaking rate, individual speaker habits and dialectal region. The most common pronunciation variation in French is liaison and optional schwa /@/ [Adda-Decker et al., 1999b]. These two phenomena occur at the word boundaries, and they can sometimes lead word errors.
Fundamental frequency (f0)/Pitch
Fundamental frequency (f0) describes the rate at which the vocal folds vibrate at the level of the laryngeal prominence (around center of the neck) and determines voice height. f0 corresponds to pitch in the perceptual term. f0 is measured in Hz (hertz) and its value is calculated by vibration frequency per second. Semitone (1/2 tone) is often used as another measurement for perceptual scales, because the sensation of sound height is represented in logarithmic6. In [Ghio, 2007], the author gives the clear example using musical notes: the difference of 130 Hz between C3 (262 Hz) and G3 (392 Hz) is perceived as the difference of 261 Hz between C4 (523 Hz) and G4 (784 Hz). Thus the octave can be manifested in 110 Hz, 220 Hz, 440 Hz, 880 Hz from one to another octave. Semitone is the smallest musical interval between two adjacent notes like C and C] or D[ used in the occidental music. One octave has 12 semitones which are equally spaced.
PFC corpus
For the other type of speech than prepared speech, we used PFC corpus [Delais-Roussarie and Durand, 2003; Durand et al., 2002; Durand et al., 2003; Durand et al., 2005]. The PFC project is an international project directed by Jacques Durand (ERSS, University of Toulouse-Le Mirail), Bernard Laks (MoDyCo, Paris West University Nanterre La Défense) and Chantal Lyche (University of Oslo and of Tromsø). The PFC project aimed at establishing a large contemporary French database recorded in French-speaking countries or regions. The PFC site1 describes 72 investigation points and 33 investigation points are fully collected2. Speech recording made by one common protocol including a 94-word list including 10 minimal pairs, text reading, directed conversation between a subject and an interviewer (formal style) and free conversation between two or more persons who are close to each other (informal style). Averages of 10 speakers are recorded in each investigation point considering the balance of gender and age.
For our studies, we used some part of PFC corpus, mainly directed interviews and free conversations for spontaneous speech style.
Table of contents :
List of Figures
List of Tables
List of Acronyms
Résumé
Introduction
I Background
1 Automatic and human speech recognition
1.1 Automatic speech recognition system
1.1.1 Voice mechanism
1.1.2 Brief history of ASR
1.1.3 ASR architecture
1.2 Pronunciation variations
1.2.1 Pronunciation variation modeling for ASR
1.2.2 Pronunciation variation modeling for French
1.3 Errors
1.3.1 Errors by ASR
1.3.2 Errors by humans
1.4 Conclusion
2 Prosody
2.1 General definition of prosody
2.1.1 Prosody of French
2.1.2 Prosody for speech technology
2.2 Acoustic correlation of prosody
2.2.1 Fundamental frequency (f0)/Pitch
2.2.2 Intensity/Loudness
2.2.3 Duration/Length
2.2.4 Formant/Timbre
2.2.5 Pauses
2.3 Prosodic structure
2.3.1 Prosodic structure of French
2.4 Prosody in perception
2.5 Conclusion
II Realized works
3 Corpora and methodology
3.1 Corpora
3.1.1 ESTER corpus
3.1.2 PFC corpus
3.2 Methodology
3.2.1 Automatic speech alignment system
3.2.2 Extraction f0, F1, F2, F3 and intensity
3.3 Summary and Conclusion
4 Classification for homophone words
4.1 Automatic transcription errors
4.2 Automatic classification
4.2.1 Corpora for automatic classification
4.2.2 Measurements of acoustic parameters
4.2.3 Considered parameters
4.2.4 Automatic homophone classification
4.3 Perceptual transcription test
4.3.1 Corpus for perceptual evaluation
4.3.2 Perceptual evaluation
4.3.3 Discussion on perceptual evaluation
4.4 Summary and conclusion
5 Large-scale prosodic analyses of French words and phrases
5.1 Corpora and methodology
5.1.1 Corpora
5.1.2 Methodology
5.2 Lexical versus grammatical words
5.2.1 f0 profiles
5.2.2 Duration profiles
5.2.3 Intensity profiles
5.2.4 Short versus long duration impact
5.3 Noun versus noun phrase
5.3.1 f0 profiles
5.3.2 Duration profiles
5.3.3 Intensity profiles
5.3.4 Intervocalic measurements
5.3.5 Homophone noun phrases: fine phonetic detail?
5.4 Conclusion
Conclusions
III Appendix
A 62 selected attributes
A.1 Intra-phonemic attributes: 40 attributes
A.2 Inter-phonemic attributes: 22 attributes
B Homophone classification results
C Average prosodic parameters
C.1 Fundamental frequency and intensity
C.2 Duration
D f0 Profiles in Terms of POS
E f0 Profiles: PFC text reading
Author’s publications
References