Concentration Inequalities and Applications to Empirical Risk Minimization

Get Complete Project Material File(s) Now! »

Learning from Survey Training Samples

This subsection is a summary of chapter 2. We place ourselves in the context of binary classification, when the observations used to train a classifier are drawn by means of a sampling/survey scheme and exhibit a complex dependence structure. We consider, (X1; Y1); : : : ; (Xn; Yn) a sample of independent copies of (X; Y ) observed on a finite population on In := f1; : : : ; ng. We call a survey sample of (possibly random) size N n of the population In, any subset s := fi1; : : : ; in(s)g 2 P(In) with cardinality N =: N(s) less than n. A sampling scheme is defined by a probability distribution Rn on the set of all possible samples s 2 P(In) conditionally on the observations Dn = f(Xi; Yi) : i 2 Ing. The probability that the unit i belongs to a random sample S drawn from the conditional distribution Rn is called first order inclusion probability and is denoted by i = PRnfi 2 Sg. We set n = (1; : : : ; n). Given an observed sample S, it is fully determined by the r.v. n = (1; : : : ; n), where i = Ifi 2 Sg for 1 i n. Most available results (see Boucheron et al. (2005c) for instance) deal with the case where thedataset Dn is at disposal. However this is the case here as we only observe a subset of observations. Therefore, these results are not directly applicable to our problem, in particular because of the dependence structure induced by the sampling scheme. Nevertheless, we show that the theory of ERM can be extended to the case where statistical learning is based on observations obtained via survey samples.

A Non Uniform Sampling Strategy for SGD

In order to speed up the learning process, we introduce a specific variant of the SGD algorithm with an adaptive sampling scheme, in the sense that it may vary with t, depending on the past iterations. We consider non uniform sampling with replacement. We first start by identifying a good sampling distribution by choosing the one minimizing the variance of the estimator. When drawing a sample S of size S with first order inclusion pi independently with replacement, the quantity To achieve the best estimation of the gradient (i.e minimizing the variance) at parameter conditionally upon the observations, it is therefore natural to sample observation Zi with probability: pi () = krl(Zi; )k= Pn j=1 krl(Zj ; )k. Unfortunately, practical implementation of the above sampling scheme is not pertinent because it requires to evaluate all gradients to calculate the norms krl(Z1; t)k; : : : ;krl(Zn; t)k at each iteration, which is precisely what we are try to avoid when using SGD. We therefore propose a sampling scheme approximating pt := (pi (t))ni =1 without requiring any additional gradient evaluations. We use some old values of the gradient in our approximation. More precisely, the main idea is to replace each unknown gradient norm krl(Zi; t)k by a (possibly outdated) norm gt;i = krl(Zi; k)k at some former instant k = k(i; t) corresponding to the last time k t when the Zi was picked. More formally, we define the random sequence gt .

Sampling Schemes and Horvitz-Thompson Estimation

Let n 1. In the standard superpopulation framework we consider, (X1; Y1); : : : ; (Xn; Yn) is a sample of independent copies of (X; Y ) observed on a finite population In := f1; : : : ; ng. We call a survey sample of (possibly random) size N n of the population In, any subset s := fi1; : : : ; in(s)g 2 P(In) with cardinality N =: N(s) less than n. A sampling design is determined by a conditional probability distribution Rn on the set of all possible samples s 2 P(In) given the original data Dn = f(Xi; Yi) : i 2 Ing. For any i 2 f1; : : : ; ng, the first order inclusion probability, i = PRnfi 2 Sg is the probability that the unit i belongs to a random sample S drawn from the conditional distribution Rn. We set = (1; : : : ; n).
The second order inclusion probabilities are denoted by i;j = PRnf(i; j) 2 S2g, for any i 6= j in f1; : : : ; ng2. The information related to the observed sample S f1; : : : ; ng is fully enclosed in the r.v. n = (1; : : : ; n), where i = Ifi 2 Sg for 1 i n. The 1-d marginal conditional distributions of the sampling scheme n given Dn are the Bernoulli distributions B(i) = i1 + (1 􀀀 i)0, 1 i n, and the covariance matrix 􀀀n of the r.v. n has entries given by 􀀀n(i; j) = i;j 􀀀 ij , with i;i = i by convention, for 1 i; j n. Observe that, equipped with the notations above, P 1in i = n(S). One may refer to Cochran (1977), Deville (1987) for accounts of survey sampling techniques. Notice also that, in many applications, the inclusion probabilities are built using some extra information, typically by means of auxiliary random variables W1; : : : ; Wn defined on ( ;A; P) and taking their values in some measurable space W: 8i 2 f1; : : : ; ng, i = Nh(Wi)= P 1jn h(Wj), where N max1in h(Wi) P 1in h(Wi) almost-surely and h : W !]0; +1[ is a measurable link function. The (Xi; Yi;Wi)’s are generally supposed to be i.i.d. copies of a generic r.v. (X; Y;W). See Särndall & B. Swensson (2003) for more details. For simplicity, the i’s are supposed to be deterministic in the subsequent analysis, which boils down to carrying out the study conditionally upon the Wi’s in the example aforementioned.

READ Chemical Markup Language

Table of contents :

List of Contributions
List of Figures
List of Tables
1 Introduction
1.1 Motivation
1.2 Learning from Survey Training Samples
1.3 Sampling Strategies for Stochastic Gradient Descent (SGD)
1.3.1 A Non Uniform Sampling Strategy for SGD
1.3.2 Horvitz Thompson Gradient Descent (HTSGD) and Applications to M-Estimation
1.3.3 Stochastic Gradient Descent Algorithms based on Incomplete U-Statistic
1.4 Fast Learning Rates for Graph Reconstruction
I Learning from Survey Training Samples
2 Learning from Survey Training Samples
2.1 Introduction
2.2 Background and Preliminaries
2.2.1 Binary Classification – Empirical Risk Minimization Theory
2.2.2 Sampling Schemes and Horvitz-Thompson Estimation
2.3 Main Results
2.3.1 Horvitz-Thompson Empirical Risk Minimization in the Rejective Case
2.3.2 Extensions to More General Sampling Schemes
2.4 Illustrative Numerical Experiments
2.5 Conclusion
2.6 Technical Proofs
2.6.1 Proof of Theorem 2
2.6.2 Bernstein’s Inequality for Sums of Negatively Associated Random Variables
2.7 On Biased HT Risk Minimization
2.8 Sampling Training Data – Technical Details
2.8.1 Further Details on the Rejective Scheme
2.8.2 Examples of Sampling Plan with Negatively Associated Random Variables
II Sampling strategies for Stochastic Gradient Descent
3 Adaptive Sampling Scheme for Incremental Optimization using Stochastic Gradient Descent Algorithm
3.1 Introduction
3.2 Non Uniform Sampling (NUS) – State of the Art
3.3 Adaptive Sampling SGD (AS-SGD)
3.3.1 The Algorithmic Principle
3.3.2 Ideal Sampling Distribution
3.3.3 A Practical Sampling Distribution – Our Proposal
3.3.4 Computationally Efficient Sampling
3.4 Performance Analysis
3.4.1 Preliminary results
3.4.2 Main results
3.5 Numerical experiments
3.6 Conclusion
3.7 Algorithms for Efficient NUS
3.8 Technical Proofs
3.8.1 Proof of Lemma 3.1
3.8.2 Proof of Lemma 3.3
3.8.3 Proof of Theorem 3.4
4 Horvitz Thompson Stochastic Gradient Descent : Application to M-estimation
4.1 Introduction
4.2 Theoretical Background and Preliminaries
4.2.1 Iterative M-Estimation and SGD Methods
4.2.2 Survey Sampling and Horvitz-Thompson Estimation
4.3 The Horvitz-Thompson Gradient Descent
4.4 Conditional Asymptotic Analysis – Main Results
4.4.1 Poisson Schemes with Unequal Inclusion Probabilities
4.4.2 Limit Theorems – Conditional Consistency and Asymptotic Normality
4.4.3 Asymptotic Covariance Optimization in the Poisson Case
4.4.4 Extensions to More General Poisson Survey Designs
4.5 Unconditional Asymptotic Analysis
4.6 Illustrative Numerical Experiments
4.6.1 Linear logistic regression
4.6.2 The Symmetric Model
4.7 Conclusion
4.8 Technical Proofs
4.1.1 Proof of Theorem 4.3
4.1.2 Proof of Theorem 4.4
4.1.3 Proof of Proposition 4.6
4.1.4 Proof of Proposition 4.7
4.1.5 Proof of Proposition 4.8
4.1.6 An Intermediary Result
4.1.7 Proof of Theorem 4.9
4.1.8 Rate Bound Analysis
5 Stochastic Gradient Descent based on incomplete U-Statistics
5.1 Background and Problem Setup
5.1.1 U-statistics: Definition and Examples
5.2 SGD Implementation based on Incomplete U-Statistics
5.2.1 Monte-Carlo Estimation of the Empirical Gradient
5.3 Generalization Bounds
5.4 Numerical Experiments
5.5 Conclusion and Perspectives
5.6 Technical Proofs
5.6.1 Proof of Proposition 5.4
5.6.2 Proof of Theorem 5.5
5.6.3 Proof of Theorem B.15
III Fast Learning Rates for Graph Reconstruction
6 On Graph Reconstruction via Empirical Risk Minimization
6.1 Introduction
6.2 Background and Preliminaries
6.2.1 A Probabilistic Setup for Preferential Attachment
6.2.2 Related Results on Empirical Risk Minimization
6.3 Empirical Reconstruction is Always Fast!
6.4 Scaling-up Empirical Risk Minimization
6.4.1 Extensions to Alternative Sampling Schemes
6.5 Numerical Experiments
6.5.1 Synthetic Graph
6.5.2 Real Network
6.6 Conclusion
6.7 Technical Proofs
6.7.1 Proof of Lemma 6.4
6.7.2 Proof of Lemma 6.5
6.7.3 Proof of Lemma 6.10
6.7.4 Proof of Lemma 6.11
6.7.5 Proof of Theorem 6.1
6.7.6 Proof of Theorem 6.6
6.7.7 Proof of Theorem 6.7
6.7.8 Proof of Proposition 6.8
7 Conclusion, Limitations & Perspectives
A Concentration Inequalities and Applications to Empirical Risk Minimization
A.1 Vapnik-Chervonenkis’s Inequality and The Method of Bounded Difference .
A.2 Concentration Inequalities for Empirical Risk Minimization
A.2.1 Inequality for Bounded Random Variables(Azuma-Hoeffding and Mc- Diarmid)
A.2.2 Bernstein-type Inequality (with Variance Term)
IV Résumé des contributions en Français
B Résumé des contributions en français
B.1 Motivation
B.2 Apprendre de données de sondages
B.3 Stratégie d’échantillonnage pour l’algorithme du gradient stochastique
B.3.1 Un stratégie d’échantillonnage non uniforme pour le SGD
B.3.2 (HTSGD) et applications à la M-estimation
B.3.3 SGD pour la minimisation de U-Statistic
B.4 Vitesse rapide pour la reconstruction de graphe
Bibliography