Get Complete Project Material File(s) Now! »
Motivation for multi-armed bandits
Clinical trials. The multi-armed bandit problem was initially motivated by clinical trials [Thompson, 1933] where the patients infected with a disease are treated by a set of drugs (one at the time). Effects of the drugs on infected patients are unknown at the beginning and the goal of the experiment is to find the best drug for the disease while curing as many people in the process as possible. An action or an arm, in this case, is a drug and a reward is whether we treated the patient successfully or not.
Packet routing. Consider a network represented by a set of vertices connected by edges and we want to send packets from the source to the destination in a given network. Every edge in a network is associated with an unknown delay depending on a traffic. In every trial a packet is sent along a chosen route from the source to the destination and the total delay of the packet is observed. The goal in this problem is to minimize the total delay of sending the packets. This problem was tackled by many papers including [Takimoto and Warmuth, 2003, McMahan and Blum, 2004, Awerbuch and Kleinberg, 2004, György et al., 2007] and several extensions of bandits were proposed to capture the problem more precisely. We will discuss these extensions later in this section. Recommender systems. For many people recommender systems [Jannach et al., 2010] are integral parts of their lives. Watching movies, listening to the music, looking for a dinner recipe we, finding a good book or restaurant. All of these situations can be formalized as a bandit problem where a recommender system suggests an item to a user and receives a feedback (whether the user likes the item or not). Using this interaction, the system can learn user preferences in order to improve recommendations in the future. Internet advertising. Consider a simple problem where an advertiser can show you one ad from the set of possible ads [Pandey et al., 2007, Schwartz, 2013, Babaioff et al., 2014]. Every time you see an ad you can decide whether you want to click on it or not and the goal of the advertiser is to show you ads in order to maximize the number of clicks. Sometimes the reward of an action can be simple and not changing too much over time (treatment for a disease) and sometimes rewards can change dramatically over time (the user suddenly started to like some movie genre). Therefore, the multi-armed bandit problem is studied under several feedback models. The most common are bandits with stochastic rewards where the rewards for an action come from a fixed distribution, and adversarial rewards where the rewards are chosen by an adversary without any statistical assumptions. We introduce these feedback models together with the setting and the goal of the learner in the following two sections.
Stochastic bandits
This problem was originally formulated by Robbins [1952]. The learner faces a set def of N actions A = [N] = f1; : : : ; Ng. Each action i is associated with an unknown probability distribution i on [0; 1] with mean i. At each time step t 2 [T ], where T 2 N, the learner selects one action at 2 A and receives a reward rt at associated with the arm at. The goal of the learner is to maximize the expected reward he PT accumulates during the game; the sum of all the expected rewards t=1 E [rt] = PTt=1 at . Knowing the reward distributions the learner could play always the action defa = arg maxi2[N] i with the highest expected reward. In order to analyze the performance of the learner, we compare his performance to the (optimal) strategy playing the best action in every time step. This performance measure is usually called cumulative (pseudo) regret denoted by RT and defined as T = i2[N] i E » =1 at # T R T max Xt ; where the expectation is taken with respect to the randomness of the adversary as well as with respect to the (possibly randomized) choices of the learner. Note that we measure the performance of the user in terms of cumulative regret instead of cumulative reward even though maximizing both of them leads to the same goal. Figure 1.1 summarizes the stochastic multi-armed bandit game.
1: Input:
2: Known set of actions [N]
3: Possibly known time horizon T
4: Unknown probability distributions 1; : : : ; N such that E [ i] = i; 8i 2 [N]
5: for t = 1 to T do
6: The learner chooses an action at 2 [N]
7: The learner receives a reward rt at
8: end for
Extensions of multi-armed bandits
The formalism of multi-armed bandits can be easily used in the problems we men-tioned before and in many other problems. However, actions in the multi-armed bandit problem are assumed to be independent and thus, provide no information about each other. On the other hand, real-world problems often come with some structure. Using this structure, one might be able to design an algorithm which can learn faster. For example, in packet routing, we usually observe delays on individ-ual segments of the route. Moreover, the learner has also some information about the paths which share some sub-path with our chosen path. In recommender sys-tems, the users usually like similar items similarly and in the internet advertising, users might be interested in a specific type of products like electronics, clothes, etc. Therefore, several extensions of multi-armed bandit problem have been studied in the past. We present several extensions in the following sections while focusing mainly on extensions related to the results presented in the thesis.
Full-information problem
Some problems come with much richer feedback than bandits. A good example is trading on a stock market where all stock prices are fully observable after each trad-ing period. This can be formalized as a full-information problem (sometimes also called a problem of prediction with expert advice) [Vovk, 1990, Littlestone and War-muth, 1994, Freund and Schapire, 1997, Cesa-Bianchi et al., 1997]. It is a sequential decision-making problem where, similarly to bandits, the learner picks an action and obtain the reward of the selected action. However, the main difference is that the learner observes the losses associated with all potential decision, regardless of his choice. Even though the full-information problem has been studied independently of multi-armed bandit problem, the problems share many similarities. The standard algorithm for the problem is called Hedge with the optimal theoretical bound of O(p ) where O is a variation of O notation ignoring log factors. Using all addiT e e p p N factor from the optimal regret bound in the bandittional information removes case which is of Oe( N T ).
Linear and contextual multi-armed bandits
In linear bandits [Auer, 2002, Li et al., 2010, Agrawal and Goyal, 2013], every arm is associated with a D-dimensional vector (or a point in RD) and the reward function is an unknown linear function in RD. The problem can be also seen as learning an unknown D-dimensional vector such that the reward corresponding to an action is xT , where x is a vector corresponding to the action. Contextual bandits bring very similar assumption on the rewards. Every action is associated with a possibly changing vector and the reward corresponding to an action can be obtained applying a function (unknown to the learner) on the vector. Usually the function is linear but not necessarily. For these settings, Auer [2002] proposed SupLinRel algorithm and showed that p it obtains Oe( DT ) regret, which matches the lower bound by Dani et al. [2008]. However, the first practical and empirically successful algorithm was LinUCB [Li et al., 2010]. Later, Chu et al. [2011] analyzed SupLinUCB, which is a LinUCB p equivalent of SupLinRel. They showed that SupLinUCB also obtains Oe( DT ) regret. Abbasi-Yadkori et al. [2011] proposed OFUL for linear bandits which obtains p O(D T ) regret. Using their analysis, it is possible to show that LinUCB obtains e p p Oe(D T ) regret as well (Remark 6). Whether LinUCB matches the ( DT ) lower bound for this setting is still an open problem.
Apart from the above approaches, an older approach to the problem is Thompson Sampling [Thompson, 1933]. Even though the algorithms based on Thompson Sam-pling are empirically very successful [Chapelle and Li, 2011], it took a long time to provide strong theoretical guarantees. Thompson Sampling for linear bandits was analyzed only recently, Agrawal and Goyal [2013] bring a new martingale technique p which enabled them to show Oe(D T ) regret bound of LinearTS. Abernethy et al. [2008] and Bubeck et al. [2012] studied a more difficult adversarial setting of linear
bandits where the reward function is time-dependent. However, it is also an open p problem if this approach has an upper bound on the regret that scales with D, instead of D.
Combinatorial multi-armed bandits
A constraint on the number of arms played by the learner in the multi-armed bandit problem can present an issue in some applications, e.g. packet routing, where the action consists of picking several connections in the network forming a path. Com-binatorial multi-armed bandits [Koolen et al., 2010, Cesa-Bianchi and Lugosi, 2012, Audibert et al., 2014] deal with this issue. It is a sequential problem where, in each time step t the environment assigns a loss value to each out of N components and the task of the learner is to choose one of the actions while trying to minimize the loss he incurs. Unlike in basic multi-armed bandit problem, where the actions consist of individual components, in combinatorial multi-armed bandit problem the actions can consist of several components. Usually, the action set S can be expressed as a subset of f0; 1gN and playing an action v 2 S results in incurring loss of components corresponding to 1’s in v
Bandits with additional information
Real-world problems are usually complex with a rich structure. Therefore, a simple multi-armed bandit problem is usually not sufficient to capture the nature of the problem. On the other hand, extensions to the multi-armed bandit problem proved to be capable of capturing many different structures of the problems. However, there are still problems not captured by these extensions or sometimes algorithms just fail to capture the nature of the problem. For example, using a basic bandit algorithm to build a movie recommendation system comes with a problem. The algorithm needs to recommend every movie at least once which goes against the nature of the problem. We address this, and several other issues in this thesis.
In the rest of this chapter, we provide a brief overview of the bandit extensions stud-ied in this thesis. We are mainly focusing on the problems with structure, usually represented by an underlying graph with actions as nodes. First, we look at the sit-uation where the rewards of connected actions are correlated and thus, observing a reward of an action may give us some approximation of other correlated rewards. In this extension, we aim for the algorithms which perform well only after a small num-ber of steps; addressing the previously mentioned problem of recommender systems. Second, we explore the problem where the learner observes some additional informa-tion on top the reward of the action he plays. This additional information is in the form of a graph where playing an action reveals the rewards (possibly perturbed by noise) of the neighbors.
Spectral bandits and smooth graph functions
The first problem we study is called spectral bandits. It is a new problem which is motivated by a range of practical applications involving graphs. One application is targeted advertisement in social networks. Here, the graph is a social network and our goal is to discover a part of the network that is interested in a given product. Interests of people in a social network tend to change smoothly [McPherson et al., 2001], because friends tend to have similar preferences. Therefore, we take advantage of this structure and formulate this problem as learning a smooth preference function on a graph.
Another motivation for this approach are recommender systems [Jannach et al., 2010]. In content-based recommendation [Chau et al., 2011], the user is recommended items that are similar to the items that the user rated highly in the past. The assumption is that users prefer similar items similarly. The similarity of the items can be measured for instance by a nearest neighbor graph [Billsus et al., 2000], where each item is a node and its neighbors are the most similar items. Our goal is to design algorithms which can leverage the fact that the reward function can be smooth in many applications and provide strong theoretical guarantees and empirical performance. Especially, we are aiming for algorithms that can perform well only after few time steps. A smooth graph function is a function on a graph that returns similar values on neighboring nodes. This concept arises frequently in manifold and semi-supervised learning [Zhu, 2008], and reflects the fact that the outcomes on the neighboring nodes tend to be similar. It is well-known [Belkin et al., 2006, 2004] that a smooth graph function can be expressed as a linear combination of the eigenvectors of the graph Laplacian with smallest eigenvalues. Therefore, the problem of learning such function can be cast as a regression problem on these eigenvectors. We bring this concept to bandits. In particular, in Chapter 2 we study a bandit problem where the arms are the nodes of a graph and the expected payoff of pulling an arm is a smooth function on this graph. We call this problem spectral bandits.
Table of contents :
Chapter 1 Introduction
1 Basic multi-armed bandits
1.1 Motivation for multi-armed bandits
1.2 Stochastic bandits
1.3 Adversarial bandits
2 Extensions of multi-armed bandits
2.1 Full-information problem
2.2 Linear and contextual multi-armed bandits
2.3 Combinatorial multi-armed bandits
3 Bandits with additional information
3.1 Spectral bandits and smooth graph functions
3.2 Bandits with side observations
Chapter 2 Spectral bandits for smooth graph functions
1 Introduction
2 Spectral bandit setting
2.1 Related work
3 Spectral bandits
3.1 Smooth graph functions
3.2 Effective dimension
3.3 Lower bound
4 Algorithms
4.1 SpectralUCB algorithm and theoretical guarantees
4.2 SpectralTS algorithm and theoretical guarantees
4.3 SpectralEliminator algorithm and theoretical guarantees
4.4 Scalability and computational complexity
5 Analysis
5.1 Preliminaries
5.2 Confidence ellipsoid
5.3 Effective dimension
5.4 Regret bound of SpectralUCB
5.5 Regret bound of SpectralTS
5.6 Regret bound of SpectralEliminator
6 Experiments
6.1 Artificial datasets
6.2 Effect of smoothness on regret
6.3 Computational complexity improvements
6.4 MovieLens experiments
6.5 Flixster experiments
6.6 Experiment design modifications
Chapter 3 Bandits with side observations
1 Framework of bandits with side observations
1.1 Existing algorithms and results
1.2 Exploration in Exp3-based algorithms
1.3 Implicit exploration and Exp3 algorithm
1.4 Exp3-based algorithms
2 Adversarial bandits with adversarial side observations
2.1 Side-observation setting with adversarial graphs
2.2 Efficient learning by implicit exploration
2.3 Exp3-IX algorithm and theoretical guarantees
3 Adversarial bandits with stochastic side observations
3.1 Side-observation setting with stochastic graphs
3.2 Exp3-Res algorithm and theoretical guarantees
3.3 Experiments
4 Adversarial bandits with noisy side observations
4.1 Side-observation setting with weighted graphs
4.2 Exp3-IXt algorithm and theoretical guarantees
4.3 Effective independence number
4.4 Exp3-WIX algorithm and theoretical guarantees
4.5 Experiments
5 Combinatorial semi-bandits with adversarial side observations
5.1 Introduction
5.2 Combinatorial side-observation setting with adversarial graphs
5.3 Implicit exploration by geometric resampling and FPL-IX algorithm
5.4 Performance guarantees for FPL-IX
6 Analysis
6.1 Regret bound of Exp3-IX
6.2 Regret bound of Exp3-Res
6.3 Regret bound of Exp3-IXt
6.4 Regret bound of Exp3-WIX
6.5 Regret bound of FPL-IX
Chapter 4 Summary and future work
Bibliography