Two probabilistic views on rewards maximization in bandit models

Get Complete Project Material File(s) Now! »

Discounted and Finite-Horizon Gittins indices

Gittins’ theorem, stated in the seminal paper [Gittins, 1979], says that an index policy, based on the (later) so-called Gittins indices, is a solution to several sequential allocation problems, when the objective is to maximize the expected sum of discounted rewards. This larger class of problems can be interpreted as planning problems in Markov Decision Processes with particular structure and is presented in the book [Gittins et al., 2011]. Here we introduce Gittins’ indices only in the context of (Bayesian) multi-armed bandit models with independent arms.
A possible definition of the Gittins’ indices (like the one given by [Berry and Fristedt, 1985]) relies on the introduction of a calibration problem for each arm, that we call C. This calibration problem is sometimes called ’one-armed bandit problem’. Let and, conditionally to , let (Xt) be an i.i.d. sequence with distribution , that is a one-armed bandit. For > R, one considers the following game, denoted by C. A each time t, an agent can choose between receiving a (known) reward or drawing the unknown arm and receiving a random reward drawn from . His goal is to maximize his rewards with respect to one of the criteria previously considered: either the sum of rewards up to horizon T, or the sum of discounted rewards with a discount factor . This game can be naturally expressed as a planning problem in a MDP, and writing the associated dynamic programming equation, one can show that the optimal policy is a stopping policy: the unknown arm is played until some stopping time after which the reward is chosen until the end of the game. When the value of gets larger, the player has less incentive to play the unknown arm. There exists a critical value such that for larger value of , the optimal strategy in C never draws the unknown arm (and always chooses ). This critical value, which represents the price worth paying for playing the arm, is the Gittins index.
In the discounted one-armed bandit, the optimal policy plays the unknown arm as long as the current posterior distribution on the parameter is such that G() > , with G() the Gittins index, defined in the following way.

Approximation of the FH-Gittins indices

[Chang and Lai, 1987] develop approximations of the classical (discounted) Gittins indices for Gaussian bandit models with a Gaussian prior distribution, when the discount factor is close to 1. To do so, they approximate the solution of each calibration problem C by the solution of an optimal stopping problem for a Brownian motion with drift, with some Gaussian prior distribution on the drift. They mention that a similar approach can be adopted to approximate the Finite-Horizon Gittins indices in the Gaussian case, that we explicit here.

Asymptotically optimal algorithms with respect to the Bayes risk

We saw that in the Bayesian framework, for a given prior distribution, the bandit problem with finite horizon T has an exact solution, that we denote by ADP (since it is solution of a Dynamic Programming equation). However, we did not obtain an expression of the Bayes risk of this optimal solution, BR0(T;ADP). In the particular case of exponential bandit models, [Lai, 1987] provides a priordependent, asymptotic lower bound on the Bayes-risk of any bandit algorithm as well as an algorithm matching this bound. Hence, the Bayes risk of the Bayesian optimal strategy in such bandit models must grow at the rate log(T)2 specified by the lower bound of [Lai, 1987]. In this section, we see that the Bayes risk of the KL-UCB algorithm, discussed in Section 1.2.3 almost matches this lower bound (Proposition 15), and we discuss possible variants of this algorithm, one of them, KL-UCB-H+, being asymptotically optimal with respect to the Bayes risk. Theorem 1.14 below is a rewriting of Theorem 3 of [Lai, 1987] in the particular case of a product prior. It holds under an extra assumption on the prior distribution: Assumption 1, already used in the previous section. The prior distribution on each arm must be supported on some interval A =]−; +[, on which the variance of the arms, b (), is bounded (see (1.24)). Recall that ]−;+[= _b (]−; +[). For Gaussian distributions with known variance, one can choose A =]−;+[= R, whereas for Bernoulli distributions, this assumption is equivalent to considering that all the means belong to an interval of the form ]p; 1 − p[.

READ HISTORICAL DEVELOPMENT OF INTERGOVERNMENTAL RELATIONS IN SOUTH AFRICA

Table of contents :

Introduction et pr´esentation des r´esultats (in French)
1 Pr´esentation des probl`emes de bandit ´etudi´es
1.1 Maximisation des r´ecompenses : mesure de performance et objectifs
1.2 Identification des meilleurs bras : mesure de performance et objectifs
2 Des algorithmes bay´esiens pour la maximisation des r´ecompenses
2.1 Deux approches probabilistes d’un probl`eme de bandit
2.2 Les algorithmes Bayes-UCB et Thompson Sampling
2.3 Des algorithmes bay´esiens pour des mod`eles plus g´en´eraux
3 Vers des algorithmes fr´equentistes optimaux pour l’identification des meilleurs bras
3.1 Une borne inf´erieure sur la complexit´e `a niveau de confiance fix´e
3.2 Deux algorithmes : KL-LUCB et KL-Racing
3.3 Caract´erisation de la complexit´e pour des mod`eles de bandit `a deux bras
4 Organisation du document
1 Two probabilistic views on rewards maximization in bandit models
1.1 Introduction
1.2 The frequentist approach
1.2.1 Lower bounds on the regret
1.2.2 Examples of bandit models and associated tools to build bandit algorithms
1.2.3 Asymptotically optimal algorithms
1.3 The Bayesian approach
1.3.1 Some examples of Bayesian bandit models.
1.3.2 Discounted and Finite-Horizon Gittins indices
1.3.3 Index policies using Gittins indices
1.3.4 Approximation of the FH-Gittins indices
1.3.5 Asymptotically optimal algorithms with respect to the Bayes risk
1.4 Numerical study and conclusions
1.5 Elements of proof
1.5.1 Changes of distribution: proof of Lemma 1.3
1.5.2 On Gittins’ theorem: proof of Theorem 1.13
1.5.3 Proofs of Bayes risk bounds
2 Bayes-UCB
2.1 Introduction
2.2 The Bayes-UCB algorithm
2.3 Analysis of the Bayes-UCB algorithm for binary rewards
2.3.1 Asymptotic optimality and links with frequentist algorithms
2.3.2 Bayes-UCB beyond Bernoulli distributions
2.3.3 Finite-time analysis
2.4 Numerical experiments
2.4.1 Binary bandits
2.4.2 Gaussian rewards with unknown means and variances
2.4.3 Sparse linear bandits
2.5 Elements of proof
2.5.1 Proof of Lemma 2.2
2.5.2 Proof of Lemma 2.6
2.5.3 Proof of Lemma 2.7
3 Thompson Sampling
3.1 Introduction
3.2 Finite-time analysis of Thompson Sampling for binary bandits
3.2.1 Sketch of Analysis
3.2.2 Proof of Theorem 3.4
3.2.3 Proof of Proposition 3.2: Exploiting the randomized nature of Thompson Sampling.
3.3 Thompson Sampling for Exponential families
3.3.1 Thompson Sampling with Jeffreys’ prior for general one-parameter canonical exponential families
3.3.2 Main result and sketch of the proof
3.4 Numerical experiments and discussion
3.4.1 Regret of Thompson Sampling
3.4.2 Bayes risk of Thompson Sampling
3.4.3 Thompson Sampling in more general frameworks
3.5 Elements of proof
3.5.1 Proof of Lemma 3.8
3.5.2 Proof of Lemma 3.9
4 Bayesian algorithms for linear contextual bandits
4.1 Introduction
4.2 Bayesian and frequentist confidence regions
4.3 The Bayes-UCB algorithm and a generalization
4.3.1 The algorithms
4.3.2 Bayesian analysis of Bayes-UCB and Bayes-LinUCB
4.3.3 Comparison with other optimistic algorithms
4.4 Thompson Sampling
4.4.1 The algorithm
4.4.2 A Bayesian analysis of Thompson Sampling
4.4.3 A frequentist analysis of Thompson Sampling
4.5 Numerical experiments
4.6 Elements of proof
4.6.1 Proof of Lemma 4.3
4.6.2 Proof of Theorem 4.6
5 Refined frequentist tools for best arm identification
5.1 Introduction
5.2 Algorithms: KL-LUCB and KL-Racing
5.2.1 Two classes of algorithms based on confidence intervals
5.2.2 Analysis of KL-Racing and KL-LUCB
5.2.3 Numerical experiments
5.2.4 Proofs of the theorems of Section 5.2
5.3 Generic lower bound on the complexity in the fixed-confidence setting
5.4 The complexity of A/B Testing
5.4.1 Lower bounds on the two complexities
5.4.2 The Gaussian Case
5.4.3 The Bernoulli Case
5.4.4 Numerical experiments
5.4.5 Proof of Theorem 5.15 and Theorem 5.16
5.5 Conclusions and future work
5.6 Elements of proof
5.6.1 A useful technical lemma
5.6.2 Proof of Lemma 5.4
5.6.3 Proof of Proposition 5.18
5.6.4 Proof of Lemma 5.21.
Conclusions and perspectives
Appendix