Turning Supervised Learning Into Multi-Turn Interactive Tasks

Get Complete Project Material File(s) Now! »

Learning to interact with the environment: Markov Deci-sion Process formalism

Modeling the environment is done using the Markov Decision Process (MDP) paradigm (Puterman, 2014). At each timestep t, the agent receives a state s 2 S and chooses an action a 2 A according to a policy 2 . The environement computes a new state s0, that depends on a transition kernel P and returns a reward r, from the reward function R : S2 A ! [0; 1] (some environments consider higher rewards, but [0; 1] is standard).
Ultimately, reinforcement learning algorithms seek to find , the optimal policy. To compare agents and assess what optimal means, one must consider a numerical criterion, the higher the value is, the better. The criterion considered in this manuscript is the discounted cumulative reward, the objective function can be written as:E » 1 trt# ; (1.1) X t=0.
2 [0; 1] being the discount factor and E the expection over trajectories when following policy . The discounted cumulative reward allows tackling infinite-horizon MDP and to tune the horizon using the parameter . Large s (close to 1) lead to very long horizons and small s generate short-sighted agents. Thus the MDP is defined as the tuple (S; A; R; P; ). To measure the policy’s quality, we define the value function V : S ! R and the state-action value function Q : S A ! R. They assess the average quantity of reward received when following the policy and are essential to the reinforcement learning paradigm. 1 # ; 8s 2 S V (s) = E » =0 trt j s0 = s (1.2) 8s 2 S; 8a 2 A Q (s; a) = E Xt » =0 trt j s0 = s; a0 = a# : 1 X t Note that, when writing V or Q , will be ignored, becoming V or Q. The optimal value function V (resp. state-value function Q ) is defined as : 8 s 2 S V s ) = max V (s);

Tackling Large Action Spaces

While using function approximation, the bigger the action space, the more accurate the approximation needs to be (Thrun et al., 1993). As a result, model training is unsta-ble and rarely used actions are overestimated (Bahdanau et al., 2016; Zaremba et al., 2016). This problem poses a significant challenge in many cases described earlier: Fine discretization, vectorial space, or environments where the number of actions is enormous such as recommender systems, language models, and industrial plants. We introduce three concepts to analyze the complexity of action spaces: Minimality, action similarity, and entanglement.

Space Minimality and Contextual Ineﬀectivness

A minimal action space is defined as a space that restricts its actions to the one used by the optimal policy . Piot et al., 2017 elaborates on a similar notion called associated set policy. The associated set-policy of indicates, for each state, the set of action that might be choosen by . A minimal action space would be the union set policy of (all the actions required to follow ). For example, the pause button will never be used in 99% of games, thus ignoring this action is beneficial. Some actions might be more subtlely useless, such as a steering angle never used in a driving game, but overall this type of action is not overrepresented in the current setup. Thus, we introduce a Local Action Space (LAS) A s defining the set of available actions in state s. LAS can also be non-minimal if is not using certain actions in s. Finding every local action space essentially means discovering the optimal policy, but quickly discovering some LAS and generalize to unseen state might alleviate sample-complexity.
LAS allows the definition of Contextual Ineﬀectiveness (CI) : actions which are non-eﬀective in certain context. For example, in games where some interactions are triggered only in front of objects (’pick up’ or ’open’). Discovering which actions are relevant in which context can drastically increase the sample eﬃciency by easing exploration. A follow-up section (Section 2.1.3.4) describes in which environment CI happens and a contribution (Section 3.2) addresses contextual eﬀectiveness, using it as an exploration signal.
More generally, when considering a class of MDPs, an action set can be minimal for certain instances but not for others. For example, multi-task agents might use a set of actions to solve one task (opening doors) and leave aside inventory, while another task does not involve doors and requires picking up objects. Thus, taking into account the goal when considering contextual ineﬀectiveness should be of interest to future works.

Facilitating learning by leveraging action similarity

If many actions trigger similar outcomes, an agent wastes time trying all of them in-dependently. Transferring knowledge between related actions is key to increase sample eﬃciency. Action similarity and contextual ineﬀectiveness are not completely orthogonal: If action A and B are similar and A is non-eﬀective in particular contexts, it would be appropriate to transfer this knowledge to B. For example, in text-based games, if the ac-tion « pick-up sword » is not working, « pick-up katana » is likely to fail too. The following section on action embeddings addresses this problem.

Action Embeddings

Intuition from NLP and Vision Embeddings learning is a subset of representation learning (Bengio et al., 2013), it converts discrete variables to a continuous domain. For example, words in NLP are one-hot encoded. However, one-hot representation cannot encode semantic proximity between words. Thus, embeddings have been mostly used in NLP to organize words in a space where similarity and composition can be easily computed (Mikolov et al., 2013). Words like « crown » and « hat » should be close in this space, and « atomic » should be quite far. How to build such space from data? A simple word embedding method could take the following form: two words are converted from one-hot encoding to a continuous vector using a MLP. Both vectors are brought closer using a mean-square error loss if two words appear in the same sentence. In this example, the space is constrained directly, but Word2Vec (Mikolov et al., 2013) constrained the space using a downstream completion task and backpropagating through the network trains the embeddings. Computing embeddings for inputs is now standard for many NLP tasks (Pennington et al., 2014; Brown et al., 2020) but computer vision methods (Akata et al., 2015) propose to embed label (or class) in a classification task. For example, « baboon » and « gibbon » should have a closer pixel distribution than « car, » this type of information should alleviate learning and generalization to new classes. RL methods could benefit from the same general ideas. For example, in a dialog system (Chandramohan et al., 2010; Gao et al., 2018; Ferreira et al., 2013), each action is a dialog act (or a sentence). Algorithms could benefit from knowing that some actions are similar: « Greetings! » and « Hello » lead to closer outcomes than « Calm down, take a pill. »
Using pre-computed embeddings in RL van Hasselt et al., 2009 describes a method to use a continuous domain to tackle a discrete action space. In mountain-car, instead of using a set of discrete forces, they compute the policy in the continuous domain and then discretize the action. It enables generalization between similar actions, which reduces sample complexity. However, the action space consists of a unidimensional steering force, which is already almost continuous and smooth. Thus, 1. finding the closest action is easy 2. The continuous domain already encode similarity. However, many domains require a diﬀerent approach as action similarity can not be easily computed, and finding the closest discrete action can take time.
Dulac-Arnold et al., 2015 generalizes van Hasselt et al., 2009’s method to higher dimen-sional domains, allowing any pre-defined embeddings. To do so, the policy outputs a con-tinuous action and uses a nearest-neighbor algorithm (Fix et al., 1989) to find the closest discrete action. To pre-compute action embeddings, Tessler et al., 2019 uses predefined word embeddings (such as Word2Vec) and they compose actions using an Orthogonal Matching Pursuit algorithm. The following section proposes to improve upon pre-computed embeddings or design al-gorithms that can learn them concurrently with the policy. Learning Action Embeddings Tennenholtz et al., 2019 proposes Act2Vec, an embed-ding method using expert demonstrations. Similar to Word2Vec (Mikolov et al., 2013), they encode action with respect to their surroundings. Actions that appear in the same context should have a similar function, thus should be close in the representation space. They showed that action embeddings encode a notion of similarity between actions, and clusters represent high-level semantic concepts. They also propose to go beyond 1-step action embeddings and embed sequences of actions (see Fig. 2.3).
Chandak et al., 2019 learns an inverse model (predicting the action between st and st+1)), using supervised learning and few collected trajectories to build an action representation. This constrains the embedding to contain information about the transition. Chen et al., 2019 refines this approach by using a probabilistic transition model. Finally, Pritz et al., 2020 builds an action representation coupled with state representation, learned alongside the policy. They also showed that embeddings trained alongside the policy were able to transfer quickly to new action domains. We expect to see more work in this direction as action representation increases eﬃciency, generalization, and adaptation to new actions.

READ NORTH AMERICAN IK LITERATURE PERTAINING TO INDIGENOUS TERTIARY EDUCATION

Discussion on Contextual Ineﬀectiveness

We previously described how to handle actions similarity, and now look to reduce such large action space by detecting and discarding useless action as defined in Section 2.1.3.1.
Formally, we defined Contextual ineﬀectiveness (CI) as actions that, in some situations, do not modify the state. Thus, learning to ignore this action set quickly is key to increase the sample eﬃciency of RL algorithms. Subsequent Section 2.3 and Section 3.2 are framed under this paradigm.
Availability known before acting Some environments already remove unavailable actions when not required by the context. Thus, at every time step, the action set can vary. For example, some Dialog System, depending on the conversation’s stage, propose a limited amount of utterances. To cope with this challenge, He et al., 2016b; He et al., 2016a propose a Q-Learning approach. They compute a state, and action embedding for each action available, then perform a dot product between the two. Since the number of actions is small, action selection and bootstrapping using an argmax are quick.

Table of contents :

1 Deep Reinforcement Learning Cookbook
1.1 Machine Learning and Deep Learning Background
1.1.1 Machine Learning Ingredients
1.1.2 Exemple of applications
1.2 Deep Learning
1.2.1 What tools are necessary to follow this thesis?
1.2.2 A brief history of Deep Learning
1.3 Reinforcement Learning
1.3.1 Learning to interact with the environment: Markov Decision Process formalism
1.3.2 How to reinforcement learn?
1.4 Deep Reinforcement Learning (DRL)
1.4.1 Deep Q-Learning (DQN)
1.4.2 Dealing with Partial Observability
1.4.3 Scaling up
1.4.4 Exploration in Reinforcement Learning
1.5 Chapter Conclusion
2 Of Actions And Constraints.
2.1 Action Space Zoo
2.1.1 Domains example
2.1.2 Discrete vs Continuous Domains
2.1.3 Tackling Large Action Spaces
2.1.4 Working with Hybrid Action Spaces
2.1.5 More Exotic Action Spaces
2.1.6 Conclusion
2.2 Deep Garcia and Fernandez
2.2.1 Worst Case Min Max formulation
2.2.2 Risk-Sensitive Criterion (CVaR)
2.2.3 Constrained Criterion
2.2.4 Conclusion
2.3 First contribution (IJCNN’19): I’m sorry Dave, I’m Afraid I Can’t Do That – Deep Q-learning from Forbidden Actions
2.4 Method
2.4.1 Feedback Signal and MDP-F
2.4.2 Frontier loss
2.5 Experiments
2.5.1 MiniGrid Enviroment
2.5.2 TextWorld Environment
2.5.3 Model and architecture
2.6 Results
2.7 Conclusion
3 Exploring using Action Relevance
3.1 Press E to Explore
3.1.1 Undirected Methods in Exploration
3.1.2 Directed Methods in Exploration
3.2 Second Contribution (IJCAI’21): Don’t Do What Doesn’t Matter, Intrinsic Motivation from Action Usefulness
3.3 Notation Adjustment
3.4 Don’t Do What Doesn’t Matter!
3.4.1 Intuition
3.4.2 Method
3.4.3 Decay illustration
3.5 Experimental settings
3.5.1 MiniGrid Environment
3.5.2 Experimental Setting
3.6 Experimental Results
3.6.1 Base environment
3.6.2 Intrinsic exploration behavior
3.6.3 Intrinsic Motivation Pitfalls
3.7 Conclusion
4 Abstraction and Goals
4.1 Introduction
4.1.1 Vision and Language Navigation
4.2 Goal-Conditionned Reinforcement Learning
4.2.1 Background And Notation
4.2.2 Hindsight Experience Replay
4.2.3 HER variants
4.3 Third Contribution (SCCI’20): HIGhER : Improving instruction following with Hindsight Generation for Experience Replay
4.4 Hindsight Generation for Experience Replay
4.5 Related Methods
4.5.1 Conditioned Language Policy
4.5.2 IRL for instruction following
4.6 Experiments
4.6.1 Experimental Setting
4.6.2 Instruction Generator Analysis
4.6.3 HIGhER for Instruction Following
4.6.4 Discussion
4.6.5 Language Learned by the Instruction Generator
4.6.6 Complementary Experiment
4.6.7 Limitations
4.7 Conclusion
5 Turning Supervised Learning Into Multi-Turn Interactive Tasks
5.1 Active Learning and Reinforcement Learning
5.2 Sequential Representation Learning
5.3 Fourth Contribution (INTERSPEECH’20): Machine of Few Words – Interactive Speaker Recognition with Reinforcement Learning
5.4 Interactive Speaker Recognition Game
5.4.1 Game Rules
5.4.2 Game notation
5.4.3 Modelling the Speaker Recognition Module
5.5 Speaker Recognition as an RL Problem
5.5.1 Markov Decision Process
5.5.2 Enquirer optimization Process
5.6 Experimental Protocol
5.6.1 Dataset
5.6.2 Audio Processing
5.6.3 Speaker Recognition Neural Modules
5.7 Experiments
5.7.1 Guesser Evaluation
5.7.2 Enquirer Evaluation
5.8 Conclusions and Future Directions
List of Acronyms
List of Symbols