Get Complete Project Material File(s) Now! »
Encoder-decoder
Standard RNN can map an input sequence to an output sequence of the same length or to a xed-size vector (the hidden state at the last timestep). In some other sequence-to-sequence tasks such as speech recognition and machine translation, the input sequence and the output sequence may have dierent sizes. Encoder-decoder is designed to solve these tasks. The traditional encoder-decoder architecture [34] is shown in Figure 2.6. An encoder RNN processes the input sequence and outputs the context vector c, which represents a summary of the input sequence. Usually, c is the nal hidden state in RNN. Another decoder RNN is used to generate the output sequence with the context c. In [2], attention mechanism is introduced to encoder-decoder in order to use dierent context vectors at each time step.
Loss function and optimization
In machine learning tasks, the loss function or objective function represents the inaccuracy of predictions. These tasks can be considered as optimization problems seeking to minimize a loss function. The most used loss functions include mean squared error, binary cross-entropy, and category cross-entropy.
Gradient descent is a common method to solve optimization problems, especially when the objective function is convex. However, in neural network models, we do not use the gradient descent directly. The main reason is that the train set becomes so big that it is expensive to compute the gradient. In addition, the objective functions are typically non-convex, and the result may converge to a local optimum. Stochastic Gradient Descent (SGD), also known as incremental gradient descent is widely used in neural network models, which is a stochastic approximation of the gradient descent. It seeks minima by iteration with a learning rate. The train set is divided into several batches. Each iteration just uses one batch and does the gradient descent with it. SGD method can diverge or converge slowly if the learning rate is set inappropriately. There are also many alternative advanced methods. For example, Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam. A brief introduction to these methods can be found in [35].
Probabilistic speaker model
Probabilistic speaker model aims at factorizing the speech signal features into factors related to speakers and other variations. A classical probabilistic speaker model is the Gaussian Mixture Model-Universal Background Model (GMM-UBM).
The UBM is a model that represents general, person-independent feature characteristics. UBM is usually represented by a GMM and trained with a lot of data [36]. The speaker model is derived from the UBM by Maximum a Posteriori (MAP) Adaptation [37]. GMM-UBM is extended to a low-rank formulation, leading to the Joint Factor Analysis (JFA) model that decomposes speech signal into speaker independent, speaker dependent, channel dependent, and residual components [38]. I-vector model [39] is a simplied version of JFA and it became the state-of-the-art in early 2010. The speaker dependent and channel dependent factors are replaced by a total variability factor: s = m + Tw.
Voice Activity Detection (VAD)
Voice Activity Detection (VAD) is the task of labeling speech and non-speech segments in an audio stream. Non-speech segments may include silence, music, laughing, and other background noises. VAD is a fundamental task in almost all elds of speech processing tasks such as speech enhancement, speaker recognition, and speech recognition [51]. In speaker diarization task, VAD has a signicant impact in two ways. First, the missed and false alarm speech segments contribute
directly to the diarization evaluation metrics such as diarization error rate (DER).
Poor VAD performance will therefore increase DER. Second, in the clustering step, the missed speech segments reduce the available data for speakers and the false alarm speech segments bring impurities into speaker clusters. So a poor VAD system also leads to an increase of clustering error. Initial speaker diarization system attempted to do VAD in clustering step where non-speech segments were treated as an extra cluster. However, it was observed that using VAD as a preprocessing step can lead to a better result [17].
[17; 51] reviewed dierent traditional approaches for the VAD task. These approaches can be separated into two categories: rule-based and model-based approaches. In recent years, neural network approaches are also successfully applied to VAD.
Table of contents :
1 Introduction
1.1 Motivations
1.2 Objectives
1.3 Overview of the Thesis
2 State of the Art
2.1 Feature extraction
2.1.1 Short-term features
2.1.2 Dynamic features
2.1.3 Prosodic features
2.2 Modeling
2.2.1 Gaussian Mixture Models (GMM)
2.2.2 Hidden Markov Models (HMM)
2.2.3 Neural networks
2.2.3.1 Multilayer Perceptron (MLP)
2.2.3.2 Convolutional Neural Network (CNN)
2.2.3.3 Recurrent Neural Network (RNN)
2.2.3.4 Encoder-decoder
2.2.3.5 Loss function and optimization
2.2.4 Speaker Modeling
2.2.4.1 Probabilistic speaker model
2.2.4.2 Neural network based speaker model
2.3 Voice Activity Detection (VAD)
2.3.1 Rule-based approaches
2.3.2 Model-based approaches
2.4 Speaker change detection (SCD)
2.5 Clustering
2.5.1 Oine clustering
2.5.1.1 Hierarchical clustering
2.5.1.2 K-means
2.5.1.3 Spectral clustering
2.5.1.4 Anity Propagation (AP)
2.5.2 Online clustering
2.6 Re-segmentation
2.7 Datasets
2.7.1 REPERE & ETAPE
2.7.2 CALLHOME
2.8 Evaluation metrics
2.8.1 VAD
2.8.2 SCD
2.8.2.1 Recall and precision
2.8.2.2 Coverage and purity
2.8.3 Clustering
2.8.3.1 Confusion
2.8.3.2 Coverage and purity
2.8.4 Diarization error rate (DER)
3 Neural Segmentation
3.1 Introduction
3.2 Denition
3.3 Voice activity detection (VAD)
3.3.1 Training on sub-sequence
3.3.2 Prediction
3.3.3 Implementation details
3.3.4 Results and discussion
3.4 Speaker change detection (SCD)
3.4.1 Class imbalance
3.4.2 Prediction
3.4.3 Implementation details
3.4.4 Experimental results
3.4.5 Discussion
3.4.5.1 Do we need to detect all speaker change points? .
3.4.5.2 Fixing class imbalance
3.4.5.3 \The Unreasonable Eectiveness of LSTMs »
3.5 Re-segmentation
3.5.1 Implementation details
3.5.2 Results
3.6 Conclusion
4 Clustering Speaker Embeddings
4.1 Introduction
4.2 Speaker embedding
4.2.1 Speaker embedding systems
4.2.2 Embeddings for xed-length segments
4.2.3 Embedding system with speaker change detection
4.2.4 Embedding system for experiments
4.3 Clustering by anity propagation
4.3.1 Implementation details
4.3.2 Results and discussions
4.3.3 Discussions
4.4 Improved similarity matrix
4.4.1 Bi-LSTM similarity measurement
4.4.2 Implementation details
4.4.2.1 Initial segmentation
4.4.2.2 Embedding systems
4.4.2.3 Network architecture
4.4.2.4 Spectral clustering
4.4.2.5 Baseline
4.4.2.6 Dataset
4.4.3 Evaluation metrics
4.4.4 Training and testing process
4.4.5 Results
4.4.6 Discussions
4.5 Conclusion
5 End-to-End Sequential Clustering
5.1 Introduction
5.2 Hyper-parameters optimization
5.2.1 Hyper-parameters
5.2.2 Separate vs. joint optimization
5.2.3 Results
5.2.4 Analysis
5.3 Neural sequential clustering
5.3.1 Motivations
5.3.2 Principle
5.3.3 Loss function
5.3.4 Model architectures
5.3.4.1 Stacked RNNs
5.3.4.2 Encoder-decoder
5.3.5 Simulated data
5.3.5.1 Label generation y
5.3.5.2 Embedding generation (x)
5.3.6 Baselines
5.3.7 Implementation details
5.3.7.1 Data
5.3.7.2 Stacked RNNs
5.3.7.3 Encoder-decoder architecture
5.3.7.4 Training and testing
5.3.7.5 Hyper-parameters tuning for baselines
5.3.8 Results
5.3.9 Discussions
5.3.9.1 What does the encoder do?
5.3.9.2 Neural sequential clustering on long sequences
5.3.9.3 Sequential clustering with stacked unidirectional RNNs.
5.4 Conclusion
6 Conclusions and Perspectives
6.1 Conclusions
6.2 Perspectives
6.2.1 Sequential clustering in real diarization scenarios
6.2.2 Overlapped speech detection
6.2.3 Online diarization system
6.2.4 End-to-end diarization system
References