Get Complete Project Material File(s) Now! »
Semi-bandit feedback
Here we formalize the CMAB problem. The outcome of the arm i ∈ [n] in round t is denoted as Xi,t ∈ R . The random vectors (Xt)t≥1 of Rn are i.i.d., with an unknown distribution PX, of mean µ∗ ∈ Rn. Nevertheless, unless otherwise stated, Xi and Xj for two distinct arms i 6= j can be arbitrarily correlated. The action (also called super-arm) selected by the agent at round t (i.e., the set of arms selected) is denoted At. The set of possible super-arms is called action space, and is noted A. It is a fixed subset of P([n]), such that each of its elements A is a subset of at most m arms. After selecting an arm At in round t, the agent receives as feedback eAt Xt. Although we consider a more general framework in the thesis, in order to simplify the explanations, we limit ourselves here to the case of a linear reward function: the reward at round t is therefore eTAt Xt. We also assume that A is such that the optimization of linear functions can be done efficiently (in time polynomial in n). The objective is to identify a policy that maximizes the expected cumulative reward over the T rounds. The expectation is taken on the randomness of the rewards and on the possible randomness of the policy followed by the agent (an action can be chosen at random). In an equivalent way, we aim at designing a policy π that minimizes the regret, defined as:
» T # RT (π) , max E X(eA − eAt )TXt . A∈A t=1.
Contrary to the classic bandit problem, there is a notion of similarity between actions. Indeed, two « close » actions, i.e., having many arms in common, tend to have a « close » reward. This structure can be exploited in the learning process, in order to counterbal-ance the combinatorial size of the action space. This results in the design of efficient super-arm selection policies, based on the principle of optimism in face of the uncer-tainty, whose regret is of the order of O( n · m log(T )/D) (Kveton, Wen, Ashkan, and Szepesvari, 2015b), where D > 0 is the minimum difference between the mean of an optimal super-arm and the mean of a sub-optimal super-arm. These policies asymp-totically reach a lower bound: considering the action space A = {A1, . . . , An/m} with Ak = {(k − 1 )m + 1, . . . , km}, and posing for all k, X (k−1)m+1 = · · · = Xkm, drawn according to a Gaussian of variance 1, we reduce the problem to a classical bandit problem, allowing us to apply the bound (1.3), with σi2 = m2. The best-known pol-icy of this kind is cucb (Combinatorial Upper Confidence Bound), which is heavily influenced by ucb. Indeed, it plays at round t the action s A arg max T + 2 log(t) ! , µ t ∈ eA i,t−1 Ni,t−1 A∈A i
A brief look at non-linear rewards and budget constraints
Generally, it is assumed that the expected reward when the agent chooses an action A is in the form r(A; µ∗). If this is the case, the function r is called reward function. It is linear when r(A; µ∗) = eTAµ∗. It is assumed here, for simplicity, that for a fixed vector µ, the function A 7→r(A; µ) can be maximized exactly and efficiently on A. We have already mentioned that this thesis did not focus solely on linear reward functions. Nevertheless, most of the reward functions we will encounter behave similarly to linear rewards. To be more precise, we will see that there is frequently an `1-norm control on the deviation of the reward when the parameter we want to learn (here the mean) varies. For example, a common assumption considered by Wang and Chen (2017) is r(A; µ) − r(A; µ0) ≤ eA (µ − µ0) 1. (1.4) Under this assumption, the optimistic approaches above remain valid (i.e. the bounds on the regret are of the same order) by considering, in policies, the optimization problem max r(A; µ) A∈A, µ∈Ct in order to select At. One difficulty with non-linear rewards in optimistic approaches is that the quantity maxµ∈Ct r(A; µ) may not be simple to compute (and even less to optimize over A ∈ A). Nevertheless, there are cases where this is possible. For example, if we have a monotonicity property (component-wise) of the function µ 7→ r(A; µ), then, for a confidence region of type `∞, we can directly plug the vector of optimistic estimates instead of the true mean in order to select the action At (Chen, Wang, and Yuan, 2013; Chen, Wang, and Yuan, 2016), the result is that this policy is computationally efficient.
A particularly interesting case of a non-linear reward function arises when we consider a budgeted combinatorial bandit problem (Xia, Qin, et al., 2016). To briefly explain this framework, there is no longer a time horizon T , but rather a budget B, and every action that is taken in a round is costly. We will see that this framework can be essentially solved in the same way as the standard problem. A notable difference is that the objective function to be optimized within each round is no longer a reward function, but a ratio between a reward function and a cost function. This type of problem is therefore at the origin of many non-linear objective functions in CMAB problems, even if the reward and cost functions were initially linear.
Probabilistically triggered arms
In the typical CMAB problem, the set of triggered arms and the action played by the agent coincide. Specifically, at each round, the agent selects an action to play, which triggers a set of arms, and the outcomes of these arms are then observed. An interesting generalization, called probabilistically triggered arms (which we abbreviate to CMAB-T in this thesis), was introduced by Chen, Wang, and Yuan (2016) and Wang and Chen (2017). The idea is that the action selected by the agent is no longer reduced to a set of observed arms, but belongs to an external space. After an action is selected, some arms are triggered probabilistically. In particular, this framework has been used to address problems such as Cascading bandits (about the selection of web pages to be displayed for a search engine query (Kveton, Szepesvari, et al., 2015; Kveton, Wen, Ashkan, and Szepesvari, 2015a)) or online influence maximization (about selecting influencers in a social network (Chen, Wang, and Yuan, 2016; Wen, Kveton, Valko, et al., 2017; Wang and Chen, 2017)). A very practical assumption so that the above regret bound still (for `∞ type regions) holds is to consider the same smoothness relation as (1.4), but by weighting each term (corresponding to an arm) in the norm by the probability of observing the corresponding arm when choosing the action. This assumption is verified in particular in the problems of cascading bandit and online influence maximization (Wang and Chen, 2017).
Challenges of the thesis and contributions
After the above overview, several questions can be asked, and trying to answer them constitutes the objectives of this thesis. In general, the focus will be on improving the efficiency of policies, both from a computational and statistical point of view. Specifically, we will be particularly interested in the following questions: Can we hope to implement an efficient version of the escb and ols-ucb policies? Are there alternatives that are computationally as well as statistically efficient (in the case of independent, or multivariate sub-Gaussian distributions)? Is it possible to associate the `2 analysis with the budgeted or the probabilis-tically triggered arms frameworks? What kind of bound on the regret can we get? Can such an analysis help improve existing policies (and the regret they have) for known problems, such as online influence maximization? Can we optimize the analysis from Degenne and Perchet (2016b)? For exam-ple, can we relax the assumption that the sub-Gaussianity matrix G is known? Is there an alternative to the family of multivariate sub-Gaussian random vari-ables? Is it possible to explicit the interpolation behavior in the regret, for exam-ple by replacing Gii(1 − γ + γm) by the smaller quantity maxA∈A: i∈A Pj∈A Gij? We will now present the contents of the thesis, chapter by chapter, focusing on aspects related to the issues raised above. Chapter 2, Multi-Armed Bandits We first review the basic results of the clas-sical multi-armed bandit problem. Some of them will be useful later in the combina-torial context.
Chapter 3, General Framework for Semi-Bandit Feedback We formalize here the framework of combinatorial stochastic semi-bandits with probabilistically triggered arms (which, we recalled, encompasses the usual framework). We then give a multitude of applications of this framework that can be found in the literature. Finally, we provide general technical results that will be useful throughout the thesis to prove upper bounds on policy regrets. More precisely, the theorems stated depend on the type of error we want to control (which is nothing else than the type of bonus we use). For the `∞ analysis (i.e., based on an `∞ type confidence region, or in an equivalent way on an `1 type bonus), we have a theorem that slightly improves (and also simplifies) the `∞ analysis of Wang and Chen (2017). As for the `2 analysis, we provide several new results, extending the work of Degenne and Perchet (2016b), and surpassing the `∞ analysis (gaining a factor m, up to a polylogarithmic factor). In doing so, we manage to associate the `2 analysis with the probabilistically trig-gered arms framework. This chapter contains some unpublished results, and some improvements of results published in: Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko (2020a) “Budgeted Online Influence Maximization”, in 37th International Conference on Machine Learning (ICML). Pierre Perrault, Vianney Perchet, Michal Valko (2020) “Covariance-adapting algorithm for semi-bandits with application to sparse rewards”, in 33rd Confer-ence on Learning Theory (COLT). Pierre Perrault, Etienne Boursier, Vianney Perchet, Michal Valko (2020) “Sta-tistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits”, in 34th Conference on Neural Information Processing Systems (NeurIPS).
Table of contents :
Abstract
Acknowledgements – Remerciements
1 Introduction
1.1 Notations
1.2 Présentation du contenu de la thèse (en Français)
1.3 Presentation of the thesis content (in English)
2 Multi-Armed Bandits
2.1 The stochastic multi-armed bandits problem
2.2 Some real world applications
2.3 Regret lower bounds
2.4 Policies
2.5 Regret upper bound analysis
2.6 Some extensions
3 General Framework for Semi-Bandit Feedback
3.1 The stochastic combinatorial semi-bandits problem
3.2 Real world applications of CMAB
3.3 Real world applications of CMAB-T
3.4 General technical results: toward proving regret upper bounds
4 An Example of CMAB-T Problem: Sequential Search-and-Stop
4.1 Problem formulation and motivation
4.2 Offline oracle
4.3 Online search-and-stop
4.4 Experiments and discussion
4.5 Missing proofs
5 The Structure of Uncertainty
5.1 Laplace’s method
5.2 Covering argument
5.3 Efficiency of algorithms
5.4 The budgeted setting
5.5 Experiments and discussion
6 Budgeted Online Influence Maximization
6.1 Problem formulation
6.2 `1-based approach
6.3 `2-based approach
6.4 Experiments and discussion
6.5 Missing proofs
4 Contents
7 Covariance-Adapting Policy
7.1 An alternative to sub-Gaussian outcomes
7.2 Covariance-dependent regret (lower) bound
7.3 Sparse outcomes
7.4 Experiments and discussion
7.5 Appendix
8 Statistical and Computational Efficiency of Thompson Sampling
8.1 A trade-off between optimality and computational efficiency?
8.2 The independent case: Thompson sampling with beta prior
8.3 The general case: Thompson sampling with Gaussian prior
8.4 Experiments and discussion
8.5 Missing proofs
9 Conclusion and Perspectives
9.1 Conclusion
9.2 Perspectives