Get Complete Project Material File(s) Now! »
Uncertain MDPs: between discrete and continuous MDPs
In this thesis, we will have to deal with MDPs with unknown r and p but for which we know some condence sets. A convenient way to describe an uncertain MDP is through the notions of \bounded-parameter MDPs » and \extended MDPs ». Bounded-parameterMDP. A bounded-parameter MDP is a collection of MDPs {with identical state-action spaces{ specied by condence bounds on the parameters (rewards and transition probabilities) representing the uncertainty about the true values. Formally, a bounded parameter MDP M is usually characterized by some compact sets Br(s, a) [0, rmax] and Bp(s, a) S (see Def. 2.1): M= n M = hS,A, r, pi : r(s, a) 2 Br(s, a), p(·|s, a) 2 Bp(s, a), 8(s, a) 2 S × A o . (2.15) Bounded-parameter MDPs were rst introduced by Givan et al. (2000) in the innite horizon discounted setting, and later used by Tewari and Bartlett (2007a) in the undiscounted setting.
The bounded parameter MDP will typically be constructed so as to include the true MDP with high probability (w.h.p.). ExtendedMDP. As pointed out by Jaksch et al. (2010, Section 3.1.1), any bounded parameter MDP can be equivalently represented by an \extended MDP ». The idea is to combine all MDPs into a single MDP with identical state space S but with an extended compact action space A+. The extended MDP corresponding to the bounded parameter MDP M dened in Eq. 2.15 is formally dened as M+ = hS,A+, r+, p+i where for all s 2 S: A+s := [ a2As {a} × Br(s, a) × Bp(s, a) 8a+ = (a, r, p) 2 A+s ,
On-lineReinforcement Learning in the infinite horizon undiscounted setting
In the previous section, we used the formalism of MDPs to describe an agent interacting with its environment. Depending on the chosen optimality criterion, we showed how to compute a (near-)optimal policy when the parameters of the MDP are fully known. In this section, we will address the case when all or part of the MDP is unknown and needs to be learned by the agent. We restrict attention to the innite horizon undiscounted setting which will be the main focus of this thesis. Although it is not always the most appropriate setting (e.g., when there is a pre-dened horizon or discount factor), it is perhaps the most general (in the limit, see Sec. 2.2) among all the settings presented in Sec. 2.1. It is also the most challenging to analyse.
The learning problem
We consider the learning problem where S, A and rmax are known, while rewards r and transition probabilities p are unknown and need to be estimated on-line i.e., in a sequential fashion. The planning algorithms presented in Sec. 2.1 cannot be used directly to compute an optimal policy and samples of r and p need to be collected rst. Rather than focusing on learning a (near-)optimal policy (e.g., with the best possible accuracygiven an horizon T), we will be interested in maximizing the cumulative reward PT t=1 rt collected up to time T. As T grows to innity, maximizing PT t=1 rt amounts to learning a gain-optimal policy since in the limit the series eventually grows as Tg (Puterman, 1994, Chapter 8), which is the best asymptotic growth rate achievable. But in the meanwhile, the learning agent needs to eciently trade-o the exploration needed to collect information about the dynamics and reward, and the exploitation of the experience gathered so far to gain as much reward as possible. In order to quantitatively assess the exploration-exploitation performance we use the concept of regret which compares the rewards accumulated by the agent and an optimal policy i.e., μ| 1 v T − PT t=1 rt. To simplify this denition, we observe that v T = LT 0 = LT h + LT 0 − LT h = Tge + h + LT 0 − LT h.
(Near) Optimal algorithms
A common strategy to eciently balance exploration and exploitation in RL is to apply the optimism in face of uncertainty (OFU) principle: the agent maintains optimistic estimates of the MDP parameters and, at each step, executes the policy with highest optimistic \value » (e.g., gain, discounted value function, etc.). In this section, we will review some of the existing RL algorithms relying on OFU.
An alternative approach is posterior sampling (Thompson, 1933b), which maintains a Bayesian distribution over MDPs (i.e., dynamics and expected reward) and, at each step, samples an MDP and executes the corresponding optimal policy (e.g., Osband et al., 2013; Abbasi-Yadkori and Szepesvari, 2015; Osband and Roy, 2017; Ouyang et al., 2017a). Unfortunately, so far all existing posterior sampling algorithms only provide guarantees on the Bayesian regret. A notable exception is the work of (Agrawal and Jia, 2017) which successfully combines posterior sampling with OFU to obtain guarantees on the frequentist regret. However, their algorithm requires to sample multiple times the posterior distribution over MDPs so as to obtain empirical high-probability condence bounds, which somehow resembles what OFU methods do in a computationally more ecient way.
Table of contents :
1 Introduction
1.1 Topic of the thesis
1.2 Motivations
1.3 Scientic approach
1.4 Open research questions of the literature
1.5 Outline of the thesis
2 Statistical analysis of the exploration-exploitation dilemma in RL
2.1 Markov Decision Processes
2.1.1 Denitions
2.1.2 Finite horizon problems
2.1.3 Innite horizon problems
2.1.4 Stochastic shortest path
2.1.5 Uncertain MDPs: between discrete and continuous MDPs
2.2 On-line Reinforcement Learning in the innite horizon undiscounted setting .
2.2.1 The learning problem
2.2.2 Theoretical benchmarks
2.2.3 (Near) Optimal algorithms
3 Improved exploration-exploitation with Bernstein bounds
3.1 Upper Condence Reinforcement Learning with Bernstein bounds
3.1.1 Detailed algorithm and notations
3.1.2 Extended value iteration
3.1.3 Linear Programming for extended value iteration
3.2 Gain-optimism in ucrlb
3.2.1 A new argument: optimistic Bellman operator
3.2.2 Proof of optimism with concentration inequalities
3.3 Bounding the optimistic bias of ucrlb: diameter and renements
3.3.1 Diameter
3.3.2 Renement of the diameter: travel-budget
3.4 Regret guarantees for ucrlb
3.5 First regret proof of ucrlb
3.5.1 Splitting into episodes
3.5.2 Plugging the optimistic Bellman optimality equation
3.5.3 Bounding the transition probabilities
3.5.4 Bounding the rewards
3.5.5 Bounding the number of episodes
3.5.6 Summing over episodes
3.5.7 Completing the regret bound of Thm. 3.4
3.6 Improved regret analysis for ucrlb using variance reduction methods
3.6.1 Bounding the sum of variances
3.6.2 Completing the regret bound of Thm. 3.5
3.7 Comparison between upper and lower-bounds
3.8 Conclusion
4 Exploration{exploitation in MDPs with innite diameter
4.1 Introduction
4.1.1 Motivations
4.1.2 Previous work
4.2 Truncated Upper-Condence RL (tucrl)
4.2.1 Formalisation of the problem
4.2.2 Algorithm
4.3 Analysis of tucrl
4.3.1 Optimistic gain and bias
4.3.2 Regret guarantees
4.3.3 Regret proofs
4.4 Experiments
4.5 Learning limitations with innite diameter
4.6 Conclusion
5 Exploration{exploitation with prior knowledge on the optimal bias span
5.1 Introduction
5.1.1 Bias span versus travel-budget
5.1.2 Exploration bonus
5.2 Span-constrained exploration{exploitation in RL: regal.c and relaxations .
5.2.1 The approach of regal.c
5.2.2 A rst relaxation of regal.c
5.3 The Optimization Problem
5.4 Planning with scopt
5.4.1 Span-constrained value and policy operators
5.4.2 Convergence and Optimality Guarantees
5.5 Learning with scal
5.5.1 Learning algorithm
5.5.2 Analysis of scal
5.6 Numerical Experiments
5.6.1 Toy MDP
5.6.2 Knight Quest
5.7 scal+: scal with exploration bonus
5.7.1 The algorithm
5.7.2 Optimistic Exploration Bonus
5.7.3 Regret Analysis of scal+
5.8 scal*: scal with tighter optimism
5.8.1 Combining the condence sets of scal with the exploration bonus of scal+
5.8.2 Implementation and performance
5.9 Conclusion
6 Hierarchical exploration{exploitations with options
6.1 Introduction
6.2 The option framework
6.2.1 Formal denition of options
6.2.2 Semi-Markov Decision Processes
6.2.3 Markov options as absorbing Markov Chains
6.2.4 MDP with options as an SMDP
6.3 Learning in Semi-Markov Decision Processes
6.3.1 The learning problem
6.3.2 sucrl: Semi-Markov Upper Condence RL
6.3.3 Regret guarantees of sucrl
6.3.4 Regret analysis of SUCRL
6.3.5 Minimax lower bound for SMDPs
6.3.6 Analyzing the impact of options on the learning process
6.4 Learning in MDPs with Options without prior knowledge
6.4.1 From absorbing to irreducible Markov Chains
6.4.2 Optimistic bilevel Bellman operator
6.4.3 fsucrl: sucrl with Irreducible Markov Chains
6.5 Numerical Experiments
6.5.1 Simple grid world.
6.5.2 Four -room maze.
6.6 Conclusion
A Appendix of Chapter 3
A.1 Bias and travel-budget
A.1.1 Proof of Thm. 3.2
A.1.2 Proof of Thm. 3.3
A.2 Concentration bounds using a martingale argument
A.2.1 Proofs of Lem. 3.1 and 3.4
A.2.2 Proofs of Lem. 3.2 and 3.7
A.2.3 Proofs of Lem. 3.3 and 3.10
A.2.4 Proofs of Lem. 3.9
A.3 Proofs of Lem. 3.5 and 3.8 (Cauchy-Schwartz)
A.4 Proof of Lem. 3.6
B Appendix of Chap. 4
B.1 Number of episodes
B.2 Proof of Thm. 4.2
C Appendix of Chap. 5
C.1 Projection on a semi-ball (proof of Lem. 5.4)
C.2 Aperiodicity transformation (proof of Lem. 5.3)
C.3 Operator of scal* (proof of Lem. 5.13)
C.4 Perturbation of scal* operator (proof of Lem. 5.14)
D Appendix of Chapter 6
D.1 Sub-exponential options (proof of Lem.6.1)
D.2 Comparison of the MDP-sample complexity and the SMDP-sample complexity
D.2.1 Counter-example 1
D.2.2 Counter-example 2
D.2.3 Conclusion