Get Complete Project Material File(s) Now! »
Hybrid MLP/HMM acoustic models
In the previous sections, we focused on the description of the HMM-based acoustic models with GMM output probability densities. An alternative architecture that has been gaining a lot of popularity in the last few years is based on hybrid MLP/HMM architecture. In such a model, the HMM state probabilities are given by the output layer of a MLP neural network.
The HMM/MLP architecture was introduced for automatic speech recognition in Morgan and Bourlard (1995). Due to the increase in power of easily available hardware and development of new training methods, the use of MLPs with several layers, the socalled deep neural networks (DNNs), was shown to substantially outperform HMM/GMM based acoustic models for a variety of tasks (Mohamed et al., 2011, 2012; Hinton et al., 2012). Because of its proved strength, the (deep) MLP/HMM architecture has risen to the new state-of-the-art in terms of acoustic modeling. As a drawback, DNNs are slower to train in comparison to HMM/GMM models.
Despite the recent advances reported on the use of neural networks, the methods proposed in this thesis are assessed in a HMM/GMM based system.
Language modeling
In the previous section, state-of-the-art acoustic modeling techniques have been introduced. This section focus on general aspects of another main component of ASR systems, the language model. For a review of this topic, the reader may refer, for example, to Rosenfeld (2000) or Goodman (2001).
This section begins with a definition of language modeling and motivation purposes. Next, the state-of-the-art n-gram based approach is presented. The last subsection presents other language modeling techniques, especially those based on continuous space vector representations.
N-gram based language models
The n-gram based framework has been the most widely used for language modeling for the past 30 years, being omnipresent in the state-of-the-art speech recognition systems either as unique solution, or in combination with other methods. The n-gram LM is derived from an approximation to the joint probability of the word sequences, assuming that the mostrelevant context information is encoded by a short span history, that is, the (n−1) preceding words. In other words, it models the language as a (n − 1)-th order Markov process. The word-based n-gram language model can be represented as follows: P (W) ≈ MY m=1 P (wm|wm−n+1, . . . ,wm−1) (2.34).
In practice, a history of at most 3 words (that is n = 4, or 4-gram language model) is used in most of the speech recognition tasks. In an empirical study, Goodman (2001) claimed that no significant improvements could be observed using contexts longer than 4 words (5-grams). Despite the (heavy) approximation performed, the n-gram parameters estimation is still a sparse data problem, which is commonly treated using a smoothing algorithm.
Other language modeling approaches
Like factored LMs, the possibility to incorporate arbitrary features is also an intrinsic characteristic of exponential language models (Della Pietra et al., 1992; Rosenfeld, 1996; Berger et al., 1996; Chen, 2009). This flexibility allows to tackle some of the n-gram based models weaknesses by including, for example, longer context dependencies and morphological or syntactic information. In addition, this approach can benefit from methods to select the most useful features to be included into the model. As a main drawback, the use and the estimation of exponential models are computationally challenging tasks.
Differently from the approaches presented so far, an underlying structure is present in decision tree based language models (Bahl et al., 1989). This approach relies on partitioning the history space at each node of a tree by asking arbitrary questions about the history. During the growing procedure, the question maximizing the likelihood of the training data is selected at each node. Data falling onto the leaves are used to calculate the model probabilities. Due to the greedy characteristic of the growing algorithm, finding the optimal tree can be a struggle, limiting the success of the approach. In Xu and Jelinek (2004), the authors have proposed to combine randomly grown decision trees as an alternative solution to avoid local optima, reporting gains in terms of perplexity and speech recognition performance over a baseline n-gram LM.
Table of contents :
Acknowledgments
Abstract
Résumé
List of Figures
List of Tables
List of Abbreviations
I. Context of this work
1. Introduction
1.1. The statistical framework
1.2. Cost efficient training data production
1.3. Matching the target data
1.4. Scientific goals
1.5. Outline
2. Automatic speech recognition
2.1. Statistical speech recognition
2.2. Acoustic modeling
2.2.1. Hidden Markov models
2.2.2. Parameter estimation
2.2.3. Model structure
2.2.4. Feature extraction
2.2.5. Hybrid MLP/HMM acoustic models
2.3. Language modeling
2.3.1. Definition
2.3.2. N-gram based language models
2.3.3. Other language modeling approaches
2.4. Decoding
2.4.1. Word recognizer
2.4.2. Evaluation measures
3. Baseline Automatic Speech Recognition System
3.1. The LIMSI broadcast recognition system
3.1.1. Acoustic modeling
3.1.2. Language modeling
3.1.3. Decoding
3.2. The European Portuguese system
3.2.1. Corpora
3.2.2. Pronunciation modeling
3.2.3. Early system
3.2.4. Improved system
3.3. The English system
3.3.1. Corpora
3.3.2. System overview
3.4. Summary
II. Unsupervised model training
4. Unsupervised HMM/GMM-based acoustic model training
4.1. Introduction
4.1.1. Maximum likelihood estimation
4.1.2. Maximum likelihood estimation without audio transcriptions
4.2. Approximating with the 1-best hypothesis
4.2.1. The influence of recognition errors
4.2.2. Generalizing the unsupervised training algorithm
4.2.3. Confidence measures
4.2.4. Weighting by confidence measures
4.2.5. Filtering by confidence measures
4.3. Approximating with multiple decoding hypotheses
4.3.1. The lattice-based training approach
4.3.2. Influence of the decoding parameters
4.3.3. Comparing lattice-based and 1-best training approaches
4.3.4. Comparing lattice-based, 1-best weighting and 1-best filtering approaches
4.4. Other training strategies
4.5. Summary
5. Unsupervised multi-layer perceptron training
5.1. Introduction
5.2. Multi-layer perceptron neural networks
5.2.1. MLP training
5.2.2. Hybrid HMM/MLP architecture
5.2.3. MLP for feature extraction
5.3. Unsupervised bottleneck MLP training
5.3.1. Experimental setup
5.3.2. Comparing PLP and MLP based acoustic models
5.3.3. Comparing unsupervised and supervised MLP and HMM training approaches
5.3.4. Filtering by confidence measures
5.3.5. Using additional untranscribed acoustic data
5.3.6. Comparing unsupervised and cross-lingual MLPs
5.4. Summary
6. Unsupervised language model training
6.1. Introduction
6.2. Unsupervised backoff n-gram language model training
6.2.1. Filtering by confidence measures
6.2.2. Weighting by confidence measures
6.2.3. Using multiple decoding hypotheses
6.2.4. Kneser-Ney smoothing with fractional counts
6.2.5. Experiments
6.3. Unsupervised neural network language model adaptation
6.3.1. Structured output layer neural network language model
6.3.2. Experiments
6.4. Summary
III. Model combination
7. Acoustic model interpolation
7.1. Introduction
7.2. Acoustic model adaptation
7.2.1. Maximum a posteriori
7.2.2. Maximum likelihood linear regression
7.2.3. Constrained maximum likelihood linear regression
7.2.4. Adaptive training approaches
7.3. Interpolating models
7.3.1. Interpolation of Gaussian mixture models
7.3.2. Data weighting
7.3.3. Estimation of the interpolation coefficients
7.4. Gaussian mixture model reduction
7.4.1. Gaussian mixture model reduction strategies
7.4.2. Constrained maximum likelihood estimation
7.5. Summary
8. Experiments with acoustic model interpolation
8.1. Introduction
8.2. Gaussian mixture reduction algorithm
8.2.1. Reduction of HMM state mixtures
8.2.2. Estimating universal background models
8.3. European Portuguese broadcast data recognition
8.3.1. Impact of interpolation coefficients
8.3.2. Comparing interpolation, data weighting, pooling and MAP adaptation
8.4. Multiple accented English data recognition
8.4.1. Experimental setup
8.4.2. Accent-adaptation via model interpolation
8.4.3. On-the-fly acoustic model interpolation
8.4.4. Leaving target accent out
8.5. Summary
IV. Conclusions
Publications by the author
Bibliography