Learning the Dirichlet Process parsimonious mixtures using Gibbs sampling

Get Complete Project Material File(s) Now! »

The finite Gaussian mixture model (GMM)

One of the used distributions to generate the observed data, that showed great performance in cluster analysis (Banfield and Raftery, 1993; Celeux and Govaert, 1995; Day, 1969; Fraley and Raftery, 1998a; Ghahramani and Hinton, 1997; Marriott, 1975; McLachlan et al., 2003; McNicholas and Mur-phy, 2008; Scott and Symons, 1981) are the normal distributions. Each component of this mixture model has a Gaussian density. It is parametrized by the mean vector µk and the covariance matrix Σk and is defined by: | k (2π)d/2 |Σk|2 −2 − k k − k pk(xi µ , Σk) = 1 exp 1 (xi µ )T Σ−1(xi µ ) The Gaussian density pk(xi|θk) can be denoted as N (µk, Σk) or N (xi|µk, Σk) where θk = (µk, Σk). Thus, the multivariate Gaussian mixture model given as K p(xi|θ) = X πk N (xi|µk, Σk), (2.4) k=1 is parametrized by the parameter vector θ = (π1, . . . , πK , µ1, . . . , µK , Σ1, . . . , ΣK ). The generative process for the Gaussian mixture model can be similarly stated, by the two steps, as in the generative process for the general finite mixture model (Equation (2.2)). However, for the GMM case, for each component k, the observation xi is generated independently from a multivariate Gaussian with the corresponding parameters θk = {µk, Σk}. This is summarized as: zi ∼ Mult(π1, . . . , πk), (2.5) xi|µzi , Σzi ∼ N (xi|µzi , Σzi ).
In the same way as for the mixture model, Figure 2.2, shows the probabilistic graphical model for the finite multivariate GMM.
An example of three component multivariate GMM in R2 with the follow-ing model parameters: π = (0.5 0.3 0.2), µ1 = (0.22 0.45), µ2 = (0.5 0.5), µ = (0.77 0.55) and Σ1 = 0.018 0.01 ,Σ2= 0.011 −0.01 , and 3 0.01 0.011 −0.01 0.018 Σ3 = Σ1, is shown in Figure 2.3.
In modeling multivariate data, the models may suﬀer from the curse of dimensionality problem, causing diﬃculties in high-dimensional data. We refer the reader, for example, to a discussion on the curse of dimensional-ity problem in mixture modeling and model-based clustering in Bouveyron (2006); Bouveyron and Brunet-Saumard (2014), for further we also discuss it in the following subsection.

Dimensionality reduction and Parsimonious mixture models

One of the most important issues in modeling and clustering high-dimensional data is the curse of dimensionality. This is due to the fact that in model-based clustering, an increase in the dimension, in general results an increase in the parameter space dimension. For example, for a multivariate Gaus-sian mixture model, with K components, the number of free parameters to estimate, for a d dimensional data is given by the following: ν(θ) = ν(π) + ν(µ) + ν(Σ).
where ν(π) = (K −1), ν(µ) = Kd and ν(Σ) = Kd(d+1)/2 which represent, respectively the number of mixing proportions, the mean vectors and the diﬀerent values of symmetric covariance matrices. One can see in Equation (2.6), that the number of parameters to estimate for the GMM is quadratic in d, meaning that a higher-dimensional data generates a larger number of model parameters to estimate. Another issue for the Gaussian mixture model estimation arises when the number of observations n is smaller then the dimension d, this producing a singular covariance matrices, thus the model-based clustering being useless. Hopefully, the model-based clustering approaches can deal with this problem of curse of dimensionality by some approaches known in literature as: dimensionality reduction, regularization methods and parsimonious mixture models. We discuss them in the next subsections.

Dimensionality reduction

A first solution is to select useful characteristics from the original data, that are suﬃcient to represent at best the original data, that is, without no significant loss of information. For example in clustering on can cite Hall et al. (2005); Murtagh (2009). In this formulation of dimensionality reduction diﬀerent linear and non-linear data dimensionality reduction techniques are proposed for optimiza-tion of the representation space. One of the most popular approaches for dimensional reduction, the Principal Component Analysis (PCA) is a linear method firstly introduced by Hotelling (1933); Pearson (1901), or it’s prob-abilistic version, that is Probabilistic PCA (PPCA) introduced by Tipping and Bishop (1999). We can cite also other linear dimensional reduction like Independent component analysis (ICA) (H´erault et al., 1985), Factor Anal-ysis (FA)(Spearman, 1904), or nonlinear dimensionality reduction methods such as, Kernel Principal Component Analysis (Sch¨olkopf et al., 1999), Rel-evance Feature Vector Machines (Tipping, 2001), etc.

Regularization methods

Another way to deal with the problem of high-dimensionality is regulariza-tion. For example, for the GMM, the issue of the curse of dimensionality is mainly related to the covariance matrix Σk needs to be inverted. This can be tackled with some numerical treatment namely the regularization methods, that consist in adding a numerical term to the covariance matrix before it is inversed. For example, one simple way is to add a positive term to the diagonal of the covariance matrix is given as follows: Σˆk = Σˆk + σkI.
This is ridge regularization, often used in Linear Discriminant Analysis (LDA). To generalize the ridge regularization, the identity matrix can be replaced by some regularization matrix (Hastie et al., 1995). We do not fo-cus on the regularization methods, however the reader can consider Mkhadri et al. (1997) paper for more details over the diﬀerent regularization methods.

READ Analyzing and Improving Software Resilience against Unanticipated Exceptions

Parsimonious mixture models

Another way to tackle the curse of dimensionality issue are the parsimonious mixture models (Banfield and Raftery, 1993; Bensmail, 1995; Bensmail and Celeux, 1996; Celeux and Govaert, 1995; Fraley and Raftery, 1998b, 2002, 2007a,b, 2005), where the main idea is reducing the number of parameters to estimate in the mixture, by parameterising the component covariance matrices. In this work we focus on these multivariate parsimonious Gaussian mixture models for modeling and clustering high-dimensional data.
Constrained Gaussian Mixture Models One of traditional way that introduces the parsimonious Gaussian models reducing the number of pa-rameters to estimate is to consider constraints for the covariance matrix. The most frequent used constraints for Gaussian mixture models are listed as follows:
1. the GMM itself consisting of the full covariance matrices Σk, for all the components ∀k = 1 . . . K, which is abbreviated of Full-GMM,
2. the Com-GMM assume that the Gaussian mixture model consists of com-ponents with equal covariance matrices Σk = Σ, ∀k = 1 . . . K.
3. the Diag-GMM in which all the components have diagonal covariance matrices: Σk = diag(σk21, . . . , σkd2),
4. the Com-Diag-GMM model, have a common diagonal covariance for all components ∀k = 1 . . . K of the model: Σk = Σ = diag(σ12, . . . , σd2),
5. the Sphe-GMM suppose spherical covariances for all the components ∀k = 1 . . . K of the model: Σk = σk2I,
6. the Com-Sphe-GMM model is a spherical model with equal covariances, for all the components ∀k = 1 . . . K, that is: Σk = Σ = σ2I.
The number of mixture parameters related to the covariance matrices, for these six constrained GMMs, is summarized in Table 2.1.

Table of contents :

Notations
1 Introduction
2 Mixture model-based clustering
2.1 Introduction
2.2 The nite mixture model
2.3 The nite Gaussian mixture model (GMM)
2.4 Dimensionality reduction and Parsimonious mixture models
2.4.1 Dimensionality reduction
2.4.2 Regularization methods
2.4.3 Parsimonious mixture models
2.5 Maximum likelihood (ML) tting of nite mixture models .
2.5.1 ML tting via the EM algorithm
2.5.2 Illustration of ML tting of a GMM
2.5.3 ML tting of the parsimonious GMMs
2.5.4 Illustration: ML tting of parsimonious GMMs
2.6 Model selection and comparison in nite mixture models
2.6.1 Model selection via information criteria
2.6.2 Model selection for parsimonious GMMs
2.6.3 Illustration: Model selection and comparison via information criteria
2.7 Conclusion
3 Bayesian mixture models for model-based clustering
3.1 Introduction
3.2 The Bayesian nite mixture model
3.3 The Bayesian Gaussian mixture model
3.4 Bayesian parsimonious GMMs
3.5 Bayesian inference of the nite mixture model
3.5.1 Maximum a posteriori (MAP) estimation for mixtures
3.5.2 Bayesian inference of the GMMs
3.5.3 MAP estimation via the EM algorithm
3.5.4 Bayesian inference of the parsimonious GMMs via the EM algorithm
3.5.5 Markov Chain Mote Carlo (MCMC) inference
3.5.6 Bayesian inference of GMMs via Gibbs sampling
3.5.7 Illustration: Bayesian inference of the GMM via Gibbs sampling
3.5.8 Bayesian inference of parsimonious GMMs via Gibbs sampling
3.5.9 Bayesian model selection and comparison using Bayes Factors
3.5.10 Experimental study
3.6 Conclusion
4 Dirichlet Process Parsimonious Mixtures (DPPM)
4.1 Introduction
4.2 Bayesian non-parametric mixtures
4.2.1 Dirichlet Processes
4.2.2 Polya Urn representation
4.2.3 Chinese Restaurant Process (CRP)
4.2.4 Stick-Breaking Construction
4.2.5 Dirichlet Process Mixture Models
4.2.6 Innite Gaussian Mixture Model and the CRP
4.2.7 Learning the Dirichlet Process models
4.3 Chinese Restaurant Process parsimonious mixture models
4.4 Learning the Dirichlet Process parsimonious mixtures using Gibbs sampling
4.5 Conclusion
5 Application on simulated data sets and real-world data sets
5.1 Introduction
5.2 Simulation study
5.2.1 Varying the clusters shapes, orientations, volumes and separation
5.2.2 Obtained results
5.2.3 Stability with respect to the hyperparameters values .
5.3 Applications on benchmarks
5.3.1 Clustering of the Old Faithful Geyser data set
5.3.2 Clustering of the Crabs data set
5.3.3 Clustering of the Diabetes data set
5.3.4 Clustering of the Iris data set
5.4 Scaled application on real-world bioacoustic data
5.5 Conclusion
6 Bayesian non-parametric Markovian perspectives
6.1 Introduction
6.2 Hierarchical Dirichlet Process Hidden Markov Model (HDPHMM)
6.3 Scaled application on a real-world bioacoustic data
6.4 Conclusion
7 Conclusion and perspectives
7.1 Conclusions
7.2 Future works
A Appendix A
A.1 Prior and posterior distributions for the model parameters .
A.1.1 Hyperparameters values
A.1.2 Spherical models
A.1.3 Diagonal models
A.1.4 General models
B Appendix B
B.1 Multinomial distribution
B.2 Normal-Inverse Wishart distribution
B.3 Dirichlet distribution
List of gures
List of tables
List of algorithms
List of my publications