Get Complete Project Material File(s) Now! »
Several strategies of model selection
Model selection via penalization
Deterministic penalties
Given a family Mn of models and the associated family of estimators (sm)m, model selection by penalized criterion consists in the choice of the best model as the minimizer of a criterion crit(•). This quantity is defined by crit(m) = γn(sm) + pen(m), where γn(sm) assesses how well sm fits the data, while the penalty pen(•) accounts for the model complexity and possibly the collection complexity as well (see [22]).
This idea dates back to the early 70’s with the seminal papers of both Akaike about FPE ([3]) and AIC ([4]), and Mallows about Cp ([68]). As illustrated by Mallows’ heuristics in the regression framework, these approaches intend to design an unbiased estimator of the risk of sm (see [69]), which we minimize over Mn. To name but a few, other examples of penalized criteria are the Bayesian Information Criterion (BIC) introduced by Schwarz [76], the Minimum Description Length (MDL) of Rissanen [73] and GICλn proposed by Nishii [70]. Whereas FPE, AIC and GICλn are defined with the quadratic loss, AIC, BIC and MDL are derived in the log-likelihood context. Except both MDL and GICλn , Baraud et al. [9] point out that the above penalties result from an asymptotic heuristics and may suffer some strong dependence on the sample size as well as the type of model collection in hand.
This is the reason why a lot of work has been devoted to derive penalties in the non-asymptotic setting (see [12] for an introduction and [69] for a wide study of such penalties in various frameworks). As for the density estimation, we mention Barron and Cover [13] (with MDL), Birg´e and Massart [20] for the quadratic loss and Castellan [35] for the Kullback-Leibler divergence. In the regression framework, Birg´e and Massart [21] provide penalties that are generalizations on the Cp criterion in a gaussian setting when the variance is known. Baraud [8] extends these penalties to the case of unknown variance, which is also one of the purposes of Baraud et al. [9], where authors derive new penalties in a wide range of settings from variable selection to change-points detection for instance.
REMARK: All these penalized criteria are justified in a homoscedastic framework due to the high level of technicalities in the heteroscedastic setting.
Deterministic penalties are derived from bounds on the uniform deviation of ℓ (s, s m ) around its expectation over the whole collection of models. This approach relies on some concentration inequalities like that of Talagrand [83] or its Bousquet’s version [25]. We refer to Massart [69] for a review on this topic and to Boucheron et al. [24] for a link with moment inequalities.
Unlike the usual bias-variance tradeoff, the need for some uniform control of the fluctuations of ℓ (s, sm ) over the whole collection of models raises the collection complexity issue with rich collections. Thus provided there are lots of models with the same dimension or not, we have to distinguish between penalties.
• Cp−like penalties are valid when the collection of models is not too rich and may be expressed as pen(m) = CDm, where C > 1 denotes an unknown universal constant. Justifications for them may be found in Baraud [8] for regression and Castellan [35] in a density estimation setup for instance.
• As for richer collections of models, Birg´e and Massart [20] in density estimation and Barron [12] for the regression have shown that the appropriate penalty structure is of the type pen(m) = c1Dm + c2Dm log N where c1 and c2 are universal positive constants, and N denotes the total number of variables in the unordered variable selection problem for instance. From a theoretical viewpoint, they argued that the log −term is unavoidable (see Birg´e and Massart [21] and Donoho and Johnstone [40]) and should be understood as the price to pay for ignoring the oracle.
Note that concentration inequalities provide tight bounds up to some constants, which induces some further work in order to get the best possible values for these constants. Lebarbier [62] has successfully performed an intensive simulation study and found that optimal constants in the gaussian change-points setup are c1 = 5 and c2 = 2. Actually, these constants may only be determined up to a multiplicative constant. In a recent paper, Birg´e and Massart [22] discuss this point and describe a strategy named the “slope heuristics”, which is theoretically proven to yield an adaptive choice of this unknown multi-plicative constant. For not too rich collections of models, Arlot and Massart [6] provide a justification of this heuristics in the random design regression framework for a heteroscedastic noise, without any distributional assumption on the residuals.
REMARK: Lavielle [61] suggests to use a Cp−like penalty whatever the collection complexity. Since this penalty is determined up to an unknown constant, he developed another heuristics which provides an adaptive choice for this constant as well. Actually, this procedure exhibits rather good performance in the change-points detection at least [71]. From a theoretical viewpoint, this procedure relates to the above Birg´e and Massart criterion in that when the dimension of the oracle model D∗ ≪ n in the change-points detection framework, the penalty may be written as pen(m) = c1Dm + c2Dm log n ≈ Dm (c1 + c2 log n) = CDm, where C denotes a constant. The Dm conclusion we may draw is that this heuristics should work as long as D∗ ≪ n, but may encounter some troubles otherwise.
Random penalties
Since they are derived from concentration inequalities, penalized criteria based on deterministic penalties may be seen as upper bounds of the ideal criterion ℓ (s, sm ). Arlot [7] introduces the ideal penalty , which is defined for any m ∈ Mn by ℓ (s, sm ) = Pnγ(sm) + penid(m) with penid(m) := (P − Pn) γ(sm).
Hence we conclude that any (meaningful) deterministic penalty is an upper bound of the ideal penalty (up to some multiplicative unknown constants).
Although concentration inequalities are sharp tools, we may try to estimate the risk. To this aim, another strategy is to use an approximation of penid(m) rather than an upper bound, which results in random penalties.
Efron [44, 45] introduced the bootstrap, which provides an approximation of the true distribution and applied this resampling algorithm to model selection ([46]). Fromont [50] formalized this idea and successfully developed bootstrap penalties in classification. Arlot [7, 5] generalized this idea to other resampling schemes.
Despite the lack of theoretical results about resampling algorithms due to the high technicality of the proofs, random penalties are useful and really attractive, since they do not depend on any distributional assumption. Whereas some deterministic penalties are only justified in a gaussian framework for instance, resampling ones are working no matter what is the distribution of the data or even if the noise is homoscedastic or heteroscedastic. In particular when the model collection is not too rich, Arlot [7] (Chap.4) showed that some penalties designed and justified in the homoscedastic case are suboptimal with respect to random penalties under heteroscedasticity.
However, a major drawback of these penalties is that they may turn out to be computation time consuming. Some explicit formulas may be derived in some particular settings like exchangeable weights combined with regressograms (see [7] Chap.6). Nevertheless in more general situations, such penalties could sill be hard to compute. There is therefore a tradeoff between the gain we may expect from the use of resampling penalties and the time we are willing to spend on their computation.
Interest of overpenalization
The first step in model selection via penalization was made by Mallows [68] and Akaike [3]. Their strategy relies on the unbiased estimation of the risk corresponding to each model. Indeed as we look for the minimizer of ℓ (s, sm ) over Mn, a plausible solution may be brought by the minimization of any reliable estimator of the latter quantity. Although this strategy seems effective when applied within each model Sm, it is no longer true as soon as we are concerned with the whole collection. Actually, the unbiased estimation of the risk is not a sufficient condition to recover the best model as we can check considering Mallows’ Cp in the homoscedastic change-points detection framework if the true variance is known.
is an unbiased estimator of the risk R(m) = ℓ (s, sm ) + σ2Dm/n. Yet, the following inequality holds
E inf Cp(m) ≤ inf E [ Cp(m) ] = inf E [ ℓ (s, sm ) ] where the discrepancy between the two sides widens with the collection complexity. This does not prove anything since the resulting m could behave very well. However, it enlightens that minimizing an estimator of the risk does not necessarily provide the minimizer of the latter (even an approximation of it). For instance, Birg´e and Massart [22] argue that Cp is not suited in some circumstances due to the high collection complexity.
“Overpenalizing” (resp. “underpenalizing”) means we use a penalty that is larger (resp. lower) with high probability than penid. Note that either over- or under-penalization induce a bias of the resulting criterion with respect to the ideal one. Any underpenalization may lead to the choice of a larger model than the best one and to very poor performance [22], which we would like to avoid.
The ideal penalty introduced by Arlot [7] is a random variable that fluctuates around its expectation. Some concentration inequalities may provide an inequality of the following type for each model m with high probability where ϕ and ψ are both positive functions, increasing with the complexity of model m. Thus, overpenal-izing consists in designing a penalty so that its expectation upper bounds E [ penid(m) ] + ψ(m) uniformly over Mn. Thus the richer the collection of models, the stronger the amount of overpenalization.
About a possible conciliation between estimation and selection
In his work, Shao [78] summarizes results about AIC and BIC which are both optimal, but in apparently different ways. Indeed whereas AIC is asymptotically efficient, BIC selects the true model (provided it belongs to the considered model collection) with probability growing to 1 as n → +∞ (consistency). However, we may then wonder whether it is possible to design a new criterion which would combine these two optimality properties. Several attempts have been made to design such a model selection criterion, for instance from MDL. Thus under some conditions, Barron et al. [14] show that the latter behaves alternatively like BIC or AIC, depending on whether data come from a parametric or nonparametric model. Besides in a bayesian framework, Hansen and Yu [53, 54] propose a modification of MDL named gMDL, which relies on a F-statistic and appears as a compromise between BIC and a AIC-like penalty. By mixing AIC-based and BIC-based estimators, Yang [93] provides a predictor capturing the behaviour of the best one of them.
Although some optimality properties may be shared by both AIC and BIC (see [78] for a review of various penalized criteria with asymptotic properties), it turns out that these properties are not the right ones we should look for. Indeed, Brown et al. [33] emphasize that asymptotic efficiency may be misleading as a pointwise optimality criterion that strongly depends on the estimated function. Thus, we should prefer the adaptivity property satisfied by AIC for instance. It provides some information about the uniform ability of a procedure to provide a reliable estimation.
Table of contents :
1 Introduction
1.1 Model selection via penalization
1.1.1 Model selection strategy
1.1.2 Efficiency and consistency
1.1.3 Complexity of the collection of models
1.2 Resampling and cross-validation
1.2.1 Risk estimation
1.2.2 Choice of p
1.2.3 Closed-form expressions
1.3 Optimal risk estimation
1.3.1 Bias-variance tradeoff
1.3.2 Multiple testing
1.4 Leave-p-out risk estimator as a penalized criterion
1.4.1 Random penalty
1.4.2 Oracle inequality and adaptivity
1.5 Change-points detection
2 Model selection
2.1 Model selection set-up
2.1.1 The model choice
2.1.2 Model collection
2.1.3 Estimation and selection
2.2 Several strategies of model selection
2.2.1 Model selection via penalization
2.2.2 Statistical tests
2.2.3 Bayes factor and posterior predictive loss
2.2.4 Model selection and multiple testing
2.3 Aggregation as an alternative to model selection
2.3.1 Aggregation for adaptation
2.3.2 Aggregation for improvement
3 Cross-validation
3.1 CV and resampling
3.1.1 Historical viewpoint
3.1.2 Resampling
3.2 What is CV?
3.2.1 CV heuristics
3.2.2 How to split the data?
3.3 Closed-form expressions
3.3.1 Preliminary calculations
3.3.2 Closed-form Lpo estimators
3.4 CV and penalized criteria
3.4.1 Moment calculations
3.4.2 CV as a random penalty
3.5 Proofs
3.5.1 Closed-form Lpo estimator
3.5.2 Moments calculations
3.5.3 Lpo penalty
4 Optimal risk estimation via cross-validation in density estimation
4.1 Abstract
4.2 Introduction
4.3 Closed-form expressions for the Lpo risk estimator
4.3.1 Framework
4.3.2 Cross-validation estimators
4.3.3 Explicit expression for the Lpo risk estimator
4.3.4 Closed formula of the bias and the variance for histograms
4.4 Reliability of estimators
4.4.1 Exact calculation of the gap between variances of Lpo and V-fold estimators
4.4.2 Choice of the parameter p
4.4.3 Adaptive selection procedure for histograms
4.5 Simulations and discussion
4.5.1 Influence of V on the choice of the optimal bandwidth
4.5.2 Density estimation by regular histogram
4.5.3 Multiple testing context and non-regular histograms
4.5.4 Density estimation by irregular histograms
4.5.5 Discussion
4.6 Appendix
4.6.1 Sketch of proof of Theorem 4.3.2
4.6.2 Proof of Proposition 4.3.1
4.6.3 Proof of Proposition 4.3.2
4.6.4 Proof of proposition 4.3.3
4.6.5 Sketch of proof of Proposition 4.4.1
4.6.6 Proof of theorem 4.4.1
5 Multiple testing
5.1 Abstract
5.2 Introduction
5.3 Estimation of the proportion of true null hypotheses
5.3.1 Mixture model
5.3.2 A leave-p-out based density estimator
5.3.3 Estimation procedure of π0
5.4 Asymptotic results
5.4.1 Pointwise convergence of LPO risk estimator
5.4.2 Consistency of bπ0
5.4.3 Asymptotic optimality of the plug-in MTP
5.5 Simulations and Discussion
5.5.1 Comparison in the usual framework (μ = 1)
5.5.2 Comparison in the U-shape case
5.5.3 Power
5.5.4 Discussion
5.6 Appendix
6 Model selection by cross-validation in density estimation
6.1 Introduction
6.2 Polynomial complexity
6.2.1 Overpenalization of the Lpo risk
6.2.2 Oracle inequality
6.2.3 Adaptivity result
6.3 Exponential complexity
6.3.1 Limitations of the CV approach
6.3.2 Oracle inequality
6.4 Proofs
6.4.1 Polynomial complexity
6.4.2 Exponential complexity
6.5 p as a regularization parameter
6.5.1 Overfitting in the polynomial framework
6.5.2 Simulations
6.6 Two-step algorithm
6.6.1 Description
6.6.2 Simulations
6.7 Discussion
7 Change-points detection via resampling
7.1 Introduction
7.2 Overview of the problem
7.2.1 Model selection view on change-points detection
7.2.2 Purpose and strategy
7.2.3 Non-asymptotic results
7.2.4 Overview of simulations
7.3 Resampling to take into account collection complexity
7.3.1 Theoretical considerations
7.3.2 Simulations
7.4 Resampling to choose the best model for each dimension
7.4.1 Theoretical considerations
7.4.2 Simulation setting
7.4.3 Study of the first step
7.4.4 Performances of 1*2VF
7.5 Open problems about complexity measures
7.6 Application to CGH microarray data
7.6.1 Biological context
7.6.2 Comparison with other algorithms
7.6.3 Results
7.7 Conclusion
7.7.1 Main results
7.7.2 Prospects
7.8 Appendix
7.8.1 Complexity measure
7.8.2 Best model choice for each dimension