JFPDA PFIA 2021 Journées Francophones sur la Planification, la Décision et l'Apprentissage pour la conduite de systèmes
←
→
Transcription du contenu de la page
Si votre navigateur ne rend pas la page correctement, lisez s'il vous plaît le contenu de la page ci-dessous
JFPDA Journées Francophones sur la Planification, la Décision et l’Apprentissage pour la conduite de systèmes PFIA 2021 Crédit photo : Flicr/xlibber
Table des matières François Schwarzentruber Éditorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Comité de programme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud Grounding Language to Autonomously-Acquired Skills via Goal Generation. . . . . . . . . . . . . . . . . . . . . .6 Aurélien Delage, Olivier, Jilles Dibangoye HSVI pour zs-POSG usant de propriétés de convexité, concavité, et Lipschitz-continuité . . . . . . 21 Sergej Scheck, Alexandre Niveau, Bruno Zanuttini Explicit Representations of Persistency for Propositional Action Theories . . . . . . . . . . . . . . . . . . . . . . . 35 Yang You, Vincent Thomas, Francis Colas, Olivier Buffet Résolution de Dec-POMDP à horizon infini à l’aide de contrôleurs à états finis dans JESP . . . . 43 Sébastien Gamblin, Alexandre Niveau, Maroua Bouzid Vérification symbolique de modèles pour la logique épistémique dynamique probabiliste. . . . . . .59 Arthur Queffelec, Ocan Sankur, François Schwarzentruber Planning for Connected Agents in a Partially Known Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Éditorial Journées Francophones sur la Planification, la Décision et l’Apprentissage pour la conduite de systèmes Les Journées Francophones Planification, Décision et Apprentissage (JFPDA) ont pour but de rassembler la communauté de chercheurs francophones travaillant sur les problèmes d’apprentissage par renforcement, de la théorie du contrôle, de programmation dynamique et plus généralement dans les domaines liés à la prise de décision séquentielle sous incertitude et à la planification. La conférence JFPDA est soutenue par le Collège Représentation et Raisonnement de l’AFIA. Nous remercions tous les membres du comité de programme pour leur travail de relecture. François Schwarzentruber JFPDA@PFIA 2021 4
Comité de programme Président — François Schwarzentruber (ENS Rennes, IRISA) Membres — Olivier Buffet (LORIA, INRIA) — Alain Dutech (LORIA, INRIA) — Humbert Fiorino (LIG, Grenoble) — Andreas Herzig (CNRS, IRIT, Université de Toulouse) — Jérôme Lang (CNRS LAMSADE, Université Paris-Dauphine) — Frédéric Maris (Université Toulouse 3 Paul Sabatier, IRIT) — Laetitia Matignon (LIRIS CNRS) — Alexandre Niveau (GREYC, Université de Normandie) — Damien Pellier (LIG, Grenoble) — Sophie Pinchinat (IRISA, Rennes) — Cédric Pralet (ONERA, Toulouse) — Philippe Preux (Université de Lille) — Emmanuel Rachelson (ISAE-SUPAERO) — Régis Sabbadin (INRA) — Abdallah Saffidine (The University of New South Wales) — Olivier Sigaud (ISIR, UPMC) — Florent Teichteil-Königsbuch (Airbus Central Resaerch & Technology) — Vincent Thomas (LORIA, Nancy) — Paul Weng (UM-SJTU Joint Institute) — Bruno Zanuttini (GREYC, Normandie Univ. ; UNICAEN, CNRS, ENSICAEN) 5 JFPDA@PFIA 2021
Grounding Language to Autonomously-Acquired Skills via Goal Generation Grounding Language to Autonomously-Acquired Skills via Goal Generation Ahmed Akakzia*1 Cédric Colas∗2 Pierre-Yves Oudeyer2 Mohamed Chetouani1 Olivier Sigaud1 1 Sorbonne Université 2 INRIA ahmed.akakzia@isir.upmc.fr Résumé havioral diversity for a given language input. To resolve L’objectif principal de cet article est de construire des these issues, we propose a new conceptual approach to agents artificiels capables d’apprendre aussi bien en au- language-conditioned RL: the Language-Goal-Behavior tonomie que sous l’assistance d’un tuteur humain. Pour architecture (LGB). LGB decouples skill learning and lan- apprendre en autonomie, ces agents doivent être capa- guage grounding via an intermediate semantic representa- bles de générer et poursuivre leurs propres buts ainsi tion of the world. To showcase the properties of LGB, we que d’apprendre à partir de leurs propres signaux de ré- present a specific implementation called DECSTR. DEC - STR is an intrinsically motivated learning agent endowed compense. Pour apprendre sous assistance externe, ils doivent interagir avec un tuteur et apprendre à suivre des with an innate semantic representation describing spatial instructions basées sur du langage naturel. Nous pro- relations between physical objects. In a first stage (G→B), posons une nouvelle architecture d’apprentissage par ren- it freely explores its environment and targets self-generated forcement conditionné sur du langage : Language-Goal- semantic configurations. In a second stage (L→G), it trains Behavior (LGB). Contrairement aux approches classiques, a language-conditioned goal generator to generate seman- LGB découple l’apprentissage sensorimoteur et l’ancrage tic goals that match the constraints expressed in language- du langage grâce à une couche sémantique intermédi- based inputs. We showcase the additional properties of LGB w.r.t. both an end-to-end LC - RL approach and a sim- aire. Nous présentons DECSTR, une instance particulière de LGB. DECSTR est intrinsèquement motivé et doté ilar approach leveraging non-semantic, continuous inter- d’une représentation spatiale basée sur des prédicats que mediate representations. Intermediate semantic represen- les enfants préverbaux maîtrisent. Nous comparons LGB tations help satisfy language commands in a diversity of à l’approche bout-à-bout où la politique est directement ways, enable strategy switching after a failure and facili- basée sur le langage (LC - RL) et à l’approche non séman- tate language grounding. tique où la politique est conditionnée sur des buts représen- Keywords tant des positions 3D. Nous montrons que LGB permet de satisfaire les instructions langagières, d’apprendre des Artificial Intelligence, Deep Reinforcement Learning, Lan- comportements plus diversifiés, de changer de stratégie en guage Grounding cas d’échec et de faciliter l’ancrage du langage. 1 Introduction Mots Clef Developmental psychology investigates the interactions Intelligence Artificielle, Apprentissage par Renforcement, between learning and developmental processes that sup- Ancrage du Langage. port the slow but extraordinary transition from the behav- Abstract ior of infants to the sophisticated intelligence of human adults (Piaget, 1977; Smith & Gasser, 2005). Inspired by We are interested in the autonomous acquisition of reper- this line of thought, the central endeavour of developmen- toires of skills. Language-conditioned reinforcement learn- tal robotics consists in shaping a set of machine learning ing (LC - RL) approaches are great tools in this quest, processes able to generate a similar growth of capabili- as they allow to express abstract goals as sets of con- ties in robots (Weng et al., 2001; Lungarella et al., 2003). straints on the states. However, most LC - RL agents are In this broad context, we are more specifically interested not autonomous and cannot learn without external instruc- in designing learning agents able to: 1) explore open- tions and feedback. Besides, their direct language con- ended environments and grow repertoires of skills in a self- dition cannot account for the goal-directed behavior of supervised way and 2) learn from a tutor via language com- pre-verbal infants and strongly limits the expression of be- mands. * Equal contribution. The design of intrinsically motivated agents marked a ma- JFPDA@PFIA 2021 6
Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud jor step towards these goals. The Intrinsically Motivated Language - Behavior (LB) Language - Goal - Behavior (LGB) Known or language-conditioned RL Semantic Goal Exploration Processes family (IMGEPs), for exam- Langage language embedding Goals Encoder ple, describes embodied agents that interact with their en- Langage inst. Encoder language Semantic Goal inst. OR vironment at the sensorimotor level and are endowed with embedding initial state Generator semantic goal goal the ability to represent and set their own goals, rewarding Policy state Policy action themselves over completion (Forestier et al., 2017). Re- state action LG phase: language grounding GB phase: skill learning cently, goal-conditioned reinforcement learning (GC - RL) (language semantic goals) (semantic goals behavior) LB phase: instruction following LGB phase: instruction-following appeared like a viable way to implement IMGEPs and tar- (language behavior) (language semantic goals behavior) get the open-ended and self-supervised acquisition of di- verse skills. Figure 1: A standard language-conditioned RL architecture (left) and our proposed LGB architecture (right). Goal-conditioned RL approaches train goal-conditioned policies to target multiple goals (Kaelbling, 1993; Schaul et al., 2015). While most GC - RL approaches express goals • It can exhibit a diversity of behaviors for any given as target features (e.g. target block positions (Andrychow- instruction, icz et al., 2017), agent positions in a maze (Schaul et al., • It can switch strategy in case of failures. 2015) or target images (Nair et al., 2018)), recent ap- Besides, we introduce an instance of LGB, named DEC - proaches started to use language to express goals, as lan- STR for DEep sets and Curriculum with SemanTic goal guage can express sets of constraints on the state space (e.g. Representations. Using DECSTR, we showcase the ad- open the red door) in a more abstract and interpretable way vantages of the conceptual decoupling idea. In the skill (Luketina et al., 2019). learning phase, the DECSTR agent evolves in a manipu- However, most GC - RL approaches – and language-based lation environment and leverages semantic representations ones (LC - RL) in particular – are not intrinsically motivated based on predicates describing spatial relations between and receive external instructions and rewards. The IMAG - physical objects. These predicates are known to be used INE approach is one of the rare examples of intrinsically by infants from a very young age (Mandler, 2012). DEC - motivated LC - RL approaches (Colas et al., 2020). In any STR autonomously learns to discover and master all reach- case, the language condition suffers from three drawbacks. able configurations in its semantic representation space. 1) It couples skill learning and language grounding. Thus, In the language grounding phase, we train a Conditional it cannot account for goal-directed behaviors in pre-verbal Variational Auto-Encoder (C - VAE) to generate semantic infants (Mandler, 1999). 2) Direct conditioning limits the goals from language instructions. Finally, we can eval- behavioral diversity associated to language input: a single uate the agent in an instruction-following phase by com- instruction leads to a low diversity of behaviors only result- posing the first two phases. The experimental section in- ing from the stochasticity of the policy or the environment. vestigates three questions: how does DECSTR perform in 3) This lack of behavioral diversity prevents agents from the three phases? How does it compare to end-to-end switching strategy after a failure. LC - RL approaches? Do we need intermediate representa- To circumvent these three limitations, one can decouple tions to be semantic? Code and videos can be found at skill learning and language grounding via an intermediate https://sites.google.com/view/decstr/. innate semantic representation. On one hand, agents can learn skills by targeting configurations from the semantic representation space. On the other hand, they can learn to 2 Related Work generate valid semantic configurations matching the con- Standard language-conditioned RL. Most approaches straints expressed by language instructions. This genera- from the LC - RL literature define instruction following tion can be the backbone of behavioral diversity: a given agents that receive external instructions and rewards (Her- sentence might correspond to a whole set of matching con- mann et al., 2017; Chan et al., 2019; Bahdanau et al., 2018; figurations. This is what we propose in this work. Cideron et al., 2019; Jiang et al., 2019; Fu et al., 2019), ex- cept the IMAGINE approach which introduced intrinsically Contributions. We propose a novel conceptual RL archi- motivated agents able to set their own goals and to imagine tecture, named LGB for Language-Goal-Behavior and pic- new ones (Colas et al., 2020). In both cases, the language- tured in Figure 1 (right). This LGB architecture enables an condition prevents the decoupling of language acquisition agent to decouple the intrinsically motivated acquisition of and skill learning, true behavioral diversity and efficient a repertoire of skills (Goals → Behavior) from language strategy switching behaviors. Our approach is different, as grounding (Language → Goals), via the use of semantic we can decouple language acquisition from skill learning. goal representation. To our knowledge, the LGB architec- The language-conditioned goal generation allows behav- ture is the only one to combine the following four features: ioral diversity and strategy switching behaviors. • It is intrinsically motivated: it selects its own (se- mantic) goals and generates its own rewards, Goal-conditioned RL with target coordinates for block • It decouples skill learning from language grounding, manipulation. Our proposed implementation of LGB, accounting for infants learning, called DECSTR, evolves in a block manipulation domain. 7 JFPDA@PFIA 2021
Grounding Language to Autonomously-Acquired Skills via Goal Generation Stacking blocks is one of the earliest benchmarks in arti- Goal → Behavior, see Figure 1 and Appendix 6. Instances ficial intelligence (e.g. Sussman (1973); Tate (1975)) and of the LGB architecture should demonstrate the four prop- has led to many simulation and robotics studies (Deisen- erties listed in the introduction: 1) be intrinsically moti- roth et al., 2011; Xu et al., 2018; Colas et al., 2019a). Re- vated; 2) decouple skill learning and language grounding cently, Lanier et al. (2019) and Li et al. (2019) demon- (by design); 3) favor behavioral diversity; 4) allow strategy strated impressive results by stacking up to 4 and 6 blocks switching. We argue that any LGB algorithm should ful- respectively. However, these approaches are not intrinsi- fill the following constraints. For LGB to be intrinsically cally motivated, involve hand-defined curriculum strategies motivated (1), the algorithm needs to integrate the gener- and express goals as specific target block positions. In con- ation and selection of semantic goals and to generate its trast, the DECSTR agent is intrinsically motivated, builds own rewards. For LGB to demonstrate behavioral diver- its own curriculum and uses semantic goal representations sity and strategy switching (3, 4), the language-conditioned (symbolic or language-based) based on spatial relations be- goal generator must efficiently model the distribution of se- tween blocks. mantic goals satisfying the constraints expressed by any Decoupling language acquisition and skill learning. language input. Several works investigate the use of semantic representa- 3.2 Environment tions to associate meanings and skills (Alomari et al., 2017; Tellex et al., 2011; Kulick et al., 2013). While the two first use semantic representations as an intermediate layer be- tween language and skills, the third one does not use lan- guage. While DECSTR acquires skills autonomously, previ- ous approaches all use skills that are either manually gener- ated (Alomari et al., 2017), hand-engineered (Tellex et al., 2011) or obtained via optimal control methods (Kulick et al., 2013). Closer to us, Lynch & Sermanet (2020) also decouple skill learning from language acquisition in a goal- conditioned imitation learning paradigm by mapping both language goals and images goals to a shared representation space. However, this approach is not intrinsically moti- vated as it relies on a dataset of human tele-operated strate- gies. The deterministic merging of representations also limits the emergence of behavioral diversity and efficient strategy-switching behaviors. Figure 2: Example configurations. Top-right: (111000100). 3 Methods The DECSTR agent evolves in the Fetch Manipulate en- This section presents our proposed Language-Goal- vironment: a robotic manipulation domain based on MU - Behavior architecture (LGB) represented in Figure 1 (Sec- JOCO (Todorov et al., 2012) and derived from the Fetch tion 3.1) and a particular instance of the LGB architec- tasks (Plappert et al., 2018), see Figure 2. Actions are 4- ture called DECSTR. We first present the environment it dimensional: 3D gripper velocities and grasping velocity. is set in [3.2], then describe the implementations of the Observations include the Cartesian and angular positions three modules composing any LGB architecture: 1) the se- and velocities of the gripper and the three blocks. Inspired mantic representation [3.3]; 2) the intrinsically motivated by the framework of Zone of Proximal Development that goal-conditioned algorithm [3.4] and 3) the language- describes how parents organize the learning environment of conditioned goal generator [3.5]. We finally present how their children (Vygotsky, 1978), we let a social partner fa- the three phases described in Figure 1 are evaluated [3.6]. cilitate DECSTR’s exploration by providing non-trivial ini- 3.1 The Language-Goal-Behavior Architec- tial configurations. After a first period of autonomous ex- ploration, the social partner initializes the scene with stacks ture of 2 blocks 21% of times, stacks of 3 blocks 9% of times, The LGB architecture is composed of three main modules. and a block is initially put in the agent’s gripper 50% of First, the semantic representation defines the behavioral times. This help is not provided during offline evaluations. and goal spaces of the agent. Second, the intrinsically mo- tivated GC - RL algorithm is in charge of the skill learning 3.3 Semantic Representation phase. Third, the language-conditioned goal generator is in Semantic predicates define the behavioral space. charge of the language grounding phase. Both phases can Defining the list of semantic predicates is defining the di- be combined in the instruction following phase. The three mensions of the behavioral space explored by the agent. It phases are respectively called G→B for Goal → Behavior, replaces the traditional definition of goal spaces and their L → G for Language → Goal and L → G → B for Language → associated reward functions. We believe it is for the best, as JFPDA@PFIA 2021 8
Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud it does not require the engineer to fully predict all possible defined goal buckets, we cluster goals based on their time behaviors within that space, to know which behaviors can of discovery, as the time of discovery is a good proxy for be achieved and which ones cannot, nor to define reward goal difficulty: easier goals are discovered earlier. Buck- functions for each of them. ets are initially empty (no known configurations). When an Semantic predicates in DECSTR. We assume the DEC - episode ends in a new configuration, the Nb = 5 buckets STR agent to have access to innate semantic representations are updated. Buckets are filled equally and the first buckets based on a list of predicates describing spatial relations be- contain the configurations discovered earlier. Thus goals tween pairs of objects in the scene. We consider two of the change buckets as new goals are discovered. spatial predicates infants demonstrate early in their devel- Tracking competence, learning progress and sampling opment (Mandler, 2012): the close and the above binary probabilities. Regularly, the DECSTR agent evaluates itself predicates. These predicates are applied to all permuta- on goal configurations sampled uniformly from the set of tions of object pairs for the 3 objects we consider: 6 per- known ones. For each bucket, it tracks the recent history of mutations for the above predicate and 3 combinations for past successes and failures when targeting the correspond- the close predicate due to its order-invariance. A seman- ing goals (last W = 1800 self-evaluations). C is estimated tic configuration is the concatenation of the evaluations of as the success rate over the most recent half of that his- these 9 predicates and represents spatial relations between tory C = Crecent . LP is estimated as the difference between C recent and the one evaluated over the first half of the his- objects in the scene. In the resulting semantic configura- tion space {0, 1}9 , the agent can reach 35 physically valid tory (Cearlier ). This is a crude estimation of the derivative of configurations, including stacks of 2 or 3 blocks and pyra- the C curve w.r.t. time: LP = Crecent - Cearlier . The sampling mids, see examples in Figure 2. The binary reward function probability Pi for bucket i is: directly derives from the semantic mapping: the agent re- (1 − Ci ) ∗ |LPi | wards itself when its current configuration cp matches the Pi = P . j ((1 − Cj ) ∗ |LPj |) goal configuration cp = g. Appendix 7 provides formal definitions and properties of predicates and semantic con- In addition to the usual LP bias (Colas et al., 2019a), this figurations. formula favors lower C when LP is similar. The absolute 3.4 Intrinsically Motivated Goal- value ensures resampling buckets whose performance de- creased (e.g. forgetting). Conditioned Reinforcement Learning Object-centered architecture. Instead of fully- This section describes the implementation of the intrinsi- connected or recurrent networks, DECSTR uses for the cally motivated goal-conditioned RL module in DECSTR. policy and critic an object-centered architecture similar It is powered by the Soft-Actor Critic algorithm (SAC) to the ones used in Colas et al. (2020); Karch et al. (Haarnoja et al., 2018) that takes as input the current state, (2020), adapted from Deep-Sets (Zaheer et al., 2017). the current semantic configuration and the goal configu- For each pair of objects, a shared network independently ration, for both the critic and the policy. We use Hind- encodes the concatenation of body and objects features and sight Experience Replay (HER) to facilitate transfer be- current and target semantic configurations, see Appendix tween goals (Andrychowicz et al., 2017). DECSTR samples Figure 6. This shared network ensures efficient transfer goals via its curriculum strategy, collects experience in the of skills between pairs of objects. A second inductive environment, then performs policy updates via SAC. This bias leverages the symmetry of the behavior required section describes two particularities of our RL implemen- to achieve above(oi , oj ) and above(oj , oi ). To ensure tation: the self-generated goal selection curriculum and the automatic transfer between the two, we present half of object-centered network architectures. Implementation de- the features (e.g. those based on pairs (oi , oj ) where tails and hyperparameters can be found in Appendix 8. i < j) with goals containing one side of the symmetry (all Goal selection and curriculum learning. The DECSTR above(oi , oj ) for i < j) and the other half with the goals agent can only select goals among the set of semantic con- containing the other side (all above(oj , oi ) for i < j). As figurations it already experienced. We use an automatic a result, the above(oi , oj ) predicates fall into the same curriculum strategy (Portelas et al., 2020) inspired from slot of the shared network inputs as their symmetric coun- the CURIOUS algorithm (Colas et al., 2019a). The DEC - terparts above(oj , oi ), only with different permutations STR agent tracks aggregated estimations of its competence of object pairs. Goals are now of size 6: 3 close and 3 (C) and learning progress (LP). Its selection of goals to tar- above predicates, corresponding to one side of the above get during data collection and goals to learn about during symmetry. Skill transfer between symmetric predicates policy updates (via HER) is biased towards goals associated are automatically ensured. Appendix 8.1 further describes with high absolute LP and low C. these inductive biases and our modular architecture. Automatic bucket generation. To facilitate robust estima- tion, LP is usually estimated on sets of goals with similar 3.5 Language-Conditioned Goal Generation difficulty or similar dynamics (Forestier et al., 2017; Co- The language-conditioned goal generation module (LGG) las et al., 2019a). While previous works leveraged expert- is a generative model of semantic representations condi- 9 JFPDA@PFIA 2021
Grounding Language to Autonomously-Acquired Skills via Goal Generation tioned by language inputs. It is trained to generate seman- in Appendix 8.2): tic configurations matching the agent’s initial configuration 1. Pairs found in D, except pairs removed to form the and the description of a change in one object-pair relation. following test sets. This calls for the extrapolation of A training dataset is collected via interactions between a known initialization-effect pairs (ci , d) to new final DECSTR agent trained in phase G → B and a social part- configurations cf (D contains only 20% of Cf on aver- ner. DECSTR generates semantic goals and pursues them. age). For each trajectory, the social partner provides a descrip- 2. Pairs that were removed from D, calling for a recom- tion d of one change in objects relations from the initial bination of known effects d on known ci . configuration ci to the final one cf . The set of possi- 3. Pairs for which the ci was entirely removed from D. ble descriptions contains 102 sentences, each describing, This calls for the transfer of known effects d on un- in a simplified language, a positive or negative shift for known ci . one of the 9 predicates (e.g. get red above green). This 4. Pairs for which the d was entirely removed from D. leads to a dataset D of 5000 triplets: (ci , d, cf ). From This calls for generalization in the language space, to this dataset, the LGG is learned using a conditional Varia- generalize unknown effects d from related descriptions tional Auto-Encoder (C - VAE) (Sohn et al., 2015). Inspired and transpose this to known ci . by the context-conditioned goal generator from Nair et al. 5. Pairs for which both the ci and the d were entirely re- (2019), we add an extra condition on language instruction moved from D. This calls for the generalizations 3 and to improve control on goal generation. The conditioning 4 combined. instruction is encoded by a recurrent network that is jointly Instruction following phase L→G→B: DECSTR is in- trained with the VAE via a mixture of Kullback-Leibler and structed to modify an object relation by one of the 102 cross-entropy losses. Appendix 8.2 provides the list of sen- sentences. Conditioned on its current configuration and in- tences and implementation details. By repeatedly sampling struction, it samples a compatible goal from the LGG, then the LGG, a set of goals is built for any language input. This pursues it with its goal-conditioned policy. We consider enables skill diversity and strategy switching: if the agent three evaluation settings: 1) performing a single instruc- fails, it can sample another valid goal to fulfill the instruc- tion; 2) performing a sequence of instructions without fail- tion, effectively switching strategy. This also enables goal ure; 3) performing a logical combination of instructions. combination using logical functions of instructions: and is The transition setup measures the success rate of the agent an intersection, or is an union and not is the complement when asked to perform the 102 instructions 5 times each, within the known set of goals. resetting the environment each time. In the expression 3.6 Evaluation of the three LGB phases setup, the agent is evaluated on 500 randomly generated logical functions of sentences, see the generation mecha- Skill learning phase G→B: DECSTR explores its seman- nism in Appendix 8.2. In both setups, we evaluate the per- tic representation space, discovers achievable configura- formance in 1-shot (SR1 ) and 5-shot (SR5 ) settings. In the tions and learns to reach them. Goal-specific performance 5-shot setting, the agent can perform strategy switching, to is evaluated offline across learning as the success rate sample new goals when previous attempts failed (without (SR) over 20 repetitions for each goal. The global perfor- reset). In the sequence setup, the agent must execute 20 se- mance SR is measured across either the set of 35 goals or quences of random instructions without reset (5-shot). We discovery-organized buckets of goals, see Section 3.4. also test behavioral diversity. We ask DECSTR to follow Language grounding phase L→G: DECSTR trains the each of the 102 instructions 50 times each and report the LGG to generate goals matching constraints expressed via number of different achieved configurations. language inputs. From a given initial configuration and a given instruction, the LGG should generate all compati- 4 Experiments ble final configurations (goals) and just these. This is the Our experimental section investigates three questions: source of behavioral diversity and strategy switching be- [4.1]: How does DECSTR perform in the three phases? haviors. To evaluate LGG, we construct a synthetic, oracle [4.2]: How does it compare to end-to-end language- dataset O of triplets (ci , d, Cf (ci , d)), where Cf (ci , d) is conditioned approaches? [4.3]: Do we need intermediate the set of all final configurations compatible with (ci , d). representations to be semantic? On average, Cf in O contains 16.7 configurations, while the training dataset D only contains 3.4 (20%). We are in- 4.1 How does DECSTR perform in the three terested in two metrics: 1) The Precision is the probability phases? that a goal sampled from the LGG belongs to Cf (true posi- This section presents the performance of the DECSTR agent tive / all positive); 2) The Recall is percentage of elements in the skill learning, language grounding, and instruction from Cf that were found by sampling the LGG 100 times following phases. (true positive / all true). These metrics are computed on 5 different subsets of the oracle dataset, each calling for a Skill learning phase G→B: Figures 3, 4, 5 show that different type of generalization (see full lists of instructions DECSTR successfully masters all reachable configurations JFPDA@PFIA 2021 10
Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud in its semantic representation space. Figure 3 shows the DECSTR Exp. Buckets Flat evolution of SR computed per bucket. Buckets are learned w/o Curr. w/o Asym. w/o ZPD in increasing order, which confirms that the time of dis- covery is a good proxy for difficulty. Figure 4 reports 1.00 C , LP and sampling probabilities P computed online us- Success Rate ing self-evaluations for an example agent. The agent lever- 0.75 ages these estimations to select its goals: first focusing on the easy goals from bucket 1, it moves on towards harder 0.50 and harder buckets as easier ones are mastered (low LP, high C). Figure 5 presents the results of ablation studies. 0.25 Each condition removes one component of DECSTR: 1) Flat replaces our object-centered modular architectures by 0.00 0 250 500 750 1000 1250 flat ones; 2) w/o Curr. replaces our automatic curriculum Episodes (x103) strategy by a uniform goal selection; 3) w/o Sym. does not use the symmetry inductive bias; 4) In w/o SP, the social Figure 5: Ablation study. Medians and interquartile ranges over partner does not provide non-trivial initial configurations. 10 seeds for DECSTR and 5 seeds for others in (a) and (c). Stars In the Expert buckets condition, the curriculum strategy is indicate significant differences to DECSTR as reported by Welch’s applied on expert-defined buckets, see Appendix 9.1. The t-tests with α = 0.05 (Colas et al., 2019b). full version of LGB performs on par with the Expert buck- ets oracle and outperforms significantly all its ablations. Appendix 10.3 presents more examples of learning trajec- Language grounding phase L→G: The LGG demon- tories, and dissects the evolution of bucket compositions strates the 5 types of generalization from Table 1. From along training. known configurations, agents can generate more goals than they observed in training data (1, 2). They can do so Bucket 1 Bucket 3 Bucket 5 from new initial configurations (3). They can generalize 1.00 Bucket 2 Bucket 4 Global to new sentences (4) and even to combinations of new sen- tences and initial configurations (5). These results assert that DECSTR generalizes well in a variety of contexts and 0.75 shows good behavioral diversity. Success Rate 0.50 Table 1: L→G phase. Metrics are averaged over 10 seeds, stdev < 0.06 and 0.07 respectively. Metrics Test 1 Test 2 Test 3 Test 4 Test 5 0.25 Precision 0.97 0.93 0.98 0.99 0.98 Recall 0.93 0.94 0.95 0.90 0.92 0.00 0 200 400 600 800 1000 1200 1400 Episodes (x103) Figure 3: Skill Learning: SR per bucket. Instruction following phase L→G→B: Table 2 presents the 1-shot and 5-shot results in the transition and expres- sion setups. In the sequence setups, DECSTR succeeds in B1 B2 B3 B4 B5 L = 14.9 ± 5.7 successive instructions (mean±stdev 1.0 over 10 seeds). These results confirm efficient language 0.5 grounding. DECSTR can follow instructions or sequences C of instructions and generalize to their logical combina- 0.0 tions. Strategy switching improves performance (SR5 - 0.10 SR 1 ). DECSTR also demonstrates strong behavioral diver- LP 0.05 sity: when asked over 10 seeds to repeat 50 times the same 0.00 instruction, it achieves at least 7.8 different configurations, 1.0 15.6 on average and up to 23 depending on the instruction. 0.5 P Table 2: L→G→B phase. Mean ± stdev over 10 seeds. 0.00 Metr. Transition Expression 200 400 600 800 1000 1200 1400 SR 1 0.89 ± 0.05 0.74 ± 0.08 Episodes (x103) SR 5 0.99 ± 0.01 0.94 ± 0.06 Figure 4: C, LP and P estimated by a DECSTR agent. 11 JFPDA@PFIA 2021
Grounding Language to Autonomously-Acquired Skills via Goal Generation 4.2 Do we need an intermediate representa- The LGB - C baseline. The LGB - C baseline uses contin- tion? uous goals expressing target block coordinates in place of semantic goals. The skill learning phase is thus equiva- This section investigates the need for an intermediate se- lent to traditional goal-conditioned RL setups in block ma- mantic representation. To this end, we introduce an end- nipulation tasks (Andrychowicz et al., 2017; Colas et al., to-end LC - RL baseline directly mapping Language to Be- 2019a; Li et al., 2019; Lanier et al., 2019). Starting from havior (L→B) and compare its performance with DECSTR the DECSTR algorithm, LGB - C adds a translation module in the instruction following phase (L→G→B). that samples a set of target block coordinates matching the The LB baseline. To limit the introduction of confound- targeted semantic configuration which is then used as the ing factors and under-tuning concerns, we base this im- goal input to the policy. In addition, we integrate defining plementation on the DECSTR code and incorporate defin- features of the state-of-the-art approach from Lanier et al. ing features of IMAGINE, a state-of-the-art language con- (2019): non-binary rewards (+1 for each well placed block) ditioned RL agent (Colas et al., 2020). We keep the same and multi-criteria HER, see details in Appendix 9.2. HER mechanism, object-centered architectures and RL al- gorithm as DECSTR. We just replace the semantic goal Comparison in skill learning phase G→B: The LGB - C baseline successfully learns to discover and master all space by the 102 language instructions. This baseline can be seen as an oracle version of the IMAGINE algorithm 35 semantic configurations by placing the three blocks where the reward function is assumed perfect, but without to randomly-sampled target coordinates corresponding to the imagination mechanism. these configurations. It does so faster than DECSTR: 708 · 103 episodes to reach SR= 95%, against 1238 · 103 for Comparison in the instruction following phase L→B DECSTR , see Appendix Figure 8. This can be explained by vs L→G→B: After training the LB baseline for 14K the denser learning signals it gets from using HER on con- episodes, we compare its performance to DECSTR’s in the tinuous targets instead of discrete ones. In this phase, how- instruction-following setup. In the transition evaluation ever, the agent only learns one parameterized skill: to place setup, LB achieves SR1 = 0.76 ± 0.001: it always manages blocks at their target position. It cannot build a repertoire to move blocks close to or far from each other, but consis- of semantic skills because it cannot discriminate between tently fails to stack them. Adding more attempts does not different block configurations. Looking at the sum of the help: SR5 = 0.76 ± 0.001. The LB baseline cannot be eval- distances travelled by the blocks or the completion time, uated in the expression setup because it does not manipu- we find that DECSTR performs opportunistic goal reach- late goal sets. Because it cannot stack blocks, LB only suc- ing: it finds simpler configurations of the blocks which sat- ceeds in 3.01 ± 0.43 random instructions in a row, against isfy its semantic goals compared to LGB - C. Blocks move 14.9 for DECSTR (sequence setup). We then evaluate LB’s less (∆dist = 26 ± 5 cm), and goals are reached faster diversity on the set of instructions it succeeds in. When (∆steps = 13 ± 4, mean±std across goals with p-values asked to repeat 50 times the same instruction, it achieves > 1.3 · 10−5 and 3.2 · 10−19 respectively). at least 3.0 different configurations, 4.2 on average and up to 5.2 depending on the instruction against 7.8, 17.1, 23 on Table 3: LGB - C performance in the L→G phase. Mean over 10 the same set of instructions for DECSTR. We did not ob- seeds. Stdev < 0.003 and 0.008 respectively. serve strategy-switching behaviors in LB, because it either always succeeds (close/far instructions) or fails (stacks). Metrics Test 1 Test 2 Test 3 Test 4 Test 5 Conclusion. The introduction of an intermediate se- Precision 0.66 0.78 0.39 0.0 0.0 mantic representation helps DECSTR decouple skill learn- Recall 0.05 0.02 0.06 0.0 0.0 ing from language grounding which, in turns, facilitates instruction-following when compared to the end-to-end Comparison in language grounding phase L→G: We language-conditioned learning of LB. This leads to im- train the LGG to generate continuous target coordinates proved scores in the transition and sequence setups. The conditioned on language inputs with a mean-squared loss direct language-conditioning of LB prevents the generaliza- and evaluate it in the same setup as DECSTR’s LGG, see tion to logical combination and leads to a reduced diversity Table 3. Although it maintains reasonable precision in the in the set of mastered instructions. Decoupling thus brings first two testing sets, the LGG achieves low recall – i.e. di- significant benefits to LGB architectures. versity – on all sets. The lack of semantic representations 4.3 Do we need a semantic intermediate rep- of skills might explain the difficulty of training a language- resentation? conditioned goal generator. This section investigates the need for the intermediate rep- Conclusion. The skill learning phase of the LGB - C base- resentation to be semantic. To this end, we introduce the line is competitive with the one of DECSTR. However, the LGB - C baseline that leverages continuous goal representa- poor performance in the language grounding phase pre- tions in place of semantic ones. We compare them on the vents this baseline to perform instruction following. For two first phases. this reason, and because semantic representations enable JFPDA@PFIA 2021 12
Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud agents to perform opportunistic goal reaching and to ac- goal sets. It would be of interest to simultaneously perform quire repertoires for semantic skills, we believe the seman- language grounding and skill learning, which would result tic representation is an essential part of the LGB architec- in “overlapping waves" of sensorimotor and linguistic de- ture. velopment (Siegler, 1998). Semantic configurations of variable size. Considering 5 Discussion and Conclusion a constant number of blocks and, thus, fixed-size configura- This paper contributes LGB, a new conceptual RL archi- tion spaces is a current limit of DECSTR. Future implemen- tecture which introduces an intermediate semantic repre- tations of LGB may handle inputs of variable sizes by lever- sentation to decouple sensorimotor learning from language aging Graph Neural Networks as in Li et al. (2019). Corre- grounding. To demonstrate its benefits, we present DEC - sponding semantic configurations could be represented as a STR , a learning agent that discovers and masters all reach- set of vectors, each encoding information about a predicate able configurations in a manipulation domain from a set and the objects it applies to. These representations could of relational spatial primitives, before undertaking an effi- be handled by Deep Sets (Zaheer et al., 2017). This would cient language grounding phase. This was made possible allow to target partial sets of predicates that would not need by the use of object-centered inductive biases, a new form to characterize all relations between all objects, facilitating of automatic curriculum learning and a novel language- scalability. conditioned goal generation module. Note that our main Conclusion In this work, we have shown that introduc- contribution is in the conceptual approach, DECSTR being ing abstract goals based on relational predicates that are only an instance to showcase its benefits. We believe that well understood by humans can serve as a pivotal repre- this approach could benefit from any improvement in GC - sentation between skill learning and interaction with a user RL (for skill learning) or generative models (for language through language. Here, the role of the social partner was grounding). limited to: 1) helping the agent to experience non-trivial Semantic representations. Results have shown that us- configurations and 2) describing the agent’s behavior in a ing predicate-based representations was sufficient for DEC - simplified language. In the future, we intend to study more STR to efficiently learn abstract goals in an opportunis- intertwined skill learning and language grounding phases, tic manner. The proposed semantic configurations show- making it possible to the social partner to teach the agent case promising properties: 1) they reduce the complexity during skill acquisition. of block manipulation where most effective works rely on a heavy hand-crafted curriculum (Li et al., 2019; Lanier Acknowledgments et al., 2019) and a specific curiosity mechanism (Li et al., This work was performed using HPC resources from 2019); 2) they facilitate the grounding of language into GENCI-IDRIS (Grant 20XX-AP010611667), the MeSU skills and 3) they enable decoupling skill learning from platform at Sorbonne-Université and the PlaFRIM experi- language grounding, as observed in infants (Piaget, 1977). mental testbed. Cédric Colas is partly funded by the French The set of semantic predicates is, of course, domain- Ministère des Armées - Direction Générale de l’Armement. dependent as it characterizes the space of behaviors that the agent can explore. However, we believe it is easier and References requires less domain knowledge to define the set of pred- Muhannad Alomari, Paul Duckworth, David C Hogg, and icates, i.e. the dimensions of the space of potential goals, Anthony G Cohn. Natural language acquisition and than it is to craft a list of goals and their associated reward grounding for embodied robotic systems. In Thirty-First functions. AAAI Conference on Artificial Intelligence, 2017. A new approach to language grounding. The approach proposed here is the first simultaneously enabling to decou- Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas ple skill learning from language grounding and fostering a Schneider, Rachel Fong, Peter Welinder, Bob Mc- diversity of possible behaviors for given instructions. In- Grew, Josh Tobin, Pieter Abbeel, and Wojciech deed, while an instruction following agent trained on goals Zaremba. Hindsight Experience Replay. arXiv preprint like put red close_to green would just push the red block arXiv:1707.01495, 2017. towards the green one, our agent can generate many match- Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, ing goal configurations. It could build a pyramid, make a Arian Hosseini, Pushmeet Kohli, and Edward Grefen- blue-green-red pile or target a dozen other compatible con- stette. Learning to understand goal specifications by figurations. This enables it to switch strategy, to find alter- modelling reward. arXiv preprint arXiv:1806.01946, native approaches to satisfy a same instruction when first 2018. attempts failed. Our goal generation module can also gen- eralize to new sentences or transpose instructed transfor- Harris Chan, Yuhuai Wu, Jamie Kiros, Sanja Fidler, and mations to unknown initial configurations. Finally, with the Jimmy Ba. Actrce: Augmenting experience via teacher’s goal generation module, the agent can deal with any logical advice for multi-goal reinforcement learning. arXiv expression made of instructions by combining generated preprint arXiv:1902.04546, 2019. 13 JFPDA@PFIA 2021
Grounding Language to Autonomously-Acquired Skills via Goal Generation Geoffrey Cideron, Mathieu Seurin, Florian Strub, and generalization in RL. arXiv preprint arXiv:2003.09443, Olivier Pietquin. Self-educated language agent with 2020. hindsight experience replay for instruction following. arXiv preprint arXiv:1910.09451, 2019. Johannes Kulick, Marc Toussaint, Tobias Lang, and Manuel Lopes. Active learning for teaching a robot Cédric Colas, Pierre-Yves Oudeyer, Olivier Sigaud, Pierre grounded relational symbols. In Twenty-Third Interna- Fournier, and Mohamed Chetouani. CURIOUS: In- tional Joint Conference on Artificial Intelligence, 2013. trinsically motivated multi-task, multi-goal reinforce- ment learning. In International Conference on Machine John B. Lanier, Stephen McAleer, and Pierre Baldi. Learning (ICML), pp. 1331–1340, 2019a. Curiosity-driven multi-criteria hindsight experience re- play. CoRR, abs/1906.03710, 2019. URL http:// Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. arxiv.org/abs/1906.03710. A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms. arXiv preprint Richard Li, Allan Jabri, Trevor Darrell, and Pulkit arXiv:1904.06979, 2019b. Agrawal. Towards practical multi-object manipulation using relational reinforcement learning. arXiv preprint Cédric Colas, Tristan Karch, Nicolas Lair, Jean-Michel arXiv:1912.11032, 2019. Dussoux, Clément Moulin-Frier, Peter Ford Dominey, and Pierre-Yves Oudeyer. Language as a cognitive tool Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob to imagine goals in curiosity-driven exploration. arXiv Foerster, Jacob Andreas, Edward Grefenstette, Shimon preprint arXiv:2002.09253, 2020. Whiteson, and Tim Rocktäschel. A survey of rein- Marc Peter Deisenroth, Carl Edward Rasmussen, and Di- forcement learning informed by natural language. arXiv eter Fox. Learning to control a low-cost manipulator preprint arXiv:1906.03926, 2019. using data-efficient reinforcement learning. Robotics: Science and Systems VII, pp. 57–64, 2011. Max Lungarella, Giorgio Metta, Rolf Pfeifer, and Giulio Sandini. Developmental robotics: a survey. Connection Sébastien Forestier, Yoan Mollard, and Pierre-Yves Science, 15(4):151–190, 2003. Oudeyer. Intrinsically motivated goal exploration pro- cesses with automatic curriculum learning. arXiv Corey Lynch and Pierre Sermanet. Grounding language in preprint arXiv:1708.02190, 2017. play. arXiv preprint arXiv:2005.07648, 2020. Justin Fu, Anoop Korattikara, Sergey Levine, and Sergio Jean M. Mandler. Preverbal representation and language. Guadarrama. From language to goals: Inverse rein- Language and space, pp. 365, 1999. forcement learning for vision-based instruction follow- ing. arXiv preprint arXiv:1902.07742, 2019. Jean M Mandler. On the spatial foundations of the concep- tual system and its enrichment. Cognitive science, 36 Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey (3):421–451, 2012. Levine. Soft actor-critic: Off-policy maximum en- tropy deep reinforcement learning with a stochastic ac- Ashvin Nair, Shikhar Bahl, Alexander Khazatsky, Vitchyr tor. arXiv preprint arXiv:1801.01290, 2018. Pong, Glen Berseth, and Sergey Levine. Contex- tual imagined goals for self-supervised robotic learning. Karl Moritz Hermann, Felix Hill, Simon Green, Fumin arXiv preprint arXiv:1910.11670, 2019. Wang, Ryan Faulkner, Hubert Soyer, David Szepes- vari, Wojciech Marian Czarnecki, Max Jaderberg, De- Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar nis Teplyashin, et al. Grounded language learning in a Bahl, Steven Lin, and Sergey Levine. Visual reinforce- simulated 3D world. arXiv preprint arXiv:1706.06551, ment learning with imagined goals. In Advances in 2017. Neural Information Processing Systems, pp. 9191–9200, Yiding Jiang, Shixiang Shane Gu, Kevin P Murphy, and 2018. Chelsea Finn. Language as an abstraction for hierarchi- Jean Piaget. The development of thought: Equilibration of cal deep reinforcement learning. In Advances in Neural cognitive structures. Viking, 1977. (Trans A. Rosin). Information Processing Systems, pp. 9414–9426, 2019. Leslie Pack Kaelbling. Learning to achieve goals. In Inter- Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob national Joint Conference on Artificial Intelligence, pp. McGrew, Bowen Baker, Glenn Powell, Jonas Schnei- 1094–1099, 1993. der, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics Tristan Karch, Cédric Colas, Laetitia Teodorescu, Clément environments and request for research. arXiv preprint Moulin-Frier, and Pierre-Yves Oudeyer. Deep sets for arXiv:1802.09464, 2018. JFPDA@PFIA 2021 14
Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hof- Supplementary Material mann, and Pierre-Yves Oudeyer. Automatic curriculum learning for deep RL: A short survey. arXiv preprint 6 LGB pseudo-code arXiv:2003.04664, 2020. Algorithm 1 and 2 present the high-level pseudo-code of Tom Schaul, Daniel Horgan, Karol Gregor, and David Sil- any algorithm following the LGB architecture for each of ver. Universal value function approximators. In Inter- the three phases. national Conference on Machine Learning, pp. 1312– 1320, 2015. Algorithm 1 LGB architecture G → B phase Robert S Siegler. Emerging minds: The process of change in children’s thinking. Oxford University Press, 1998. . Goal → Behavior phase 1: Require Env E Linda Smith and Michael Gasser. The development of em- 2: Initialize policy Π, goal sampler Gs , buffer B bodied cognition: Six lessons from babies. Artificial life, 3: loop 11(1-2):13–29, 2005. 4: g ← Gs .sample() Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning 5: (s, a, s0 , g, cp , c0p )traj ← E.rollout(g) structured output representation using deep conditional 6: Gs .update(cTp ) generative models. In Advances in neural information 7: B.update((s, a, s0 , g, cp , c0p )traj ) processing systems, pp. 3483–3491, 2015. 8: Π.update(B) 9: return Π, Gs Gerald J. Sussman. A computational model of skill acquisi- 10: tion. Technical report, HIT Technical Report AI TR-297, 11: 1973. 12: Austin Tate. Interacting goals and their use. In Interna- tional Joint Conference on Artificial Intelligence, vol- Algorithm 2 LGB architecture ume 10, pp. 215–218, 1975. L → G and L → G → B phases Stefanie Tellex, Thomas Kollar, Steven Dickerson, . Language → Goal phase Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, 1: Require Π, E, Gs , social partner SP and Nicholas Roy. Approaching the symbol grounding 2: Initialize language goal generator LGG problem with probabilistic graphical models. AI maga- 3: dataset ← SP .interact(E, Π, Gs ) zine, 32(4):64–76, 2011. 4: LGG.update(dataset) 5: return LGG Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- . Language → Behavior phase joco: A physics engine for model-based control. In 6: Require E, Π, LGG, SP 2012 IEEE/RSJ International Conference on Intelligent 7: loop Robots and Systems, pp. 5026–5033. IEEE, 2012. 8: instr. ← SP .listen() L. S. Vygotsky. Tool and Symbol in Child Development. In 9: loop . Strategy switching loop Mind in Society, chapter Tool and Symbol in Child De- 10: g ← LGG.sample(instr., c0 ) velopment, pp. 19–30. Harvard University Press, 1978. 11: cTp ← E.rollout(g) ISBN 0674576292. doi: 10.2307/j.ctvjf9vz4.6. 12: if g == cTp then break J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stock- man, M. Sur, and E. Thelen. Autonomous mental de- 7 Semantic predicates and applica- velopment by robots and animals. Science, 291(5504): tion to fetch manipulate 599–600, 2001. In this paper, we restrict the semantic representations to Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh the use of the close and above binary predicates applied to Garg, Li Fei-Fei, and Silvio Savarese. Neural task M = 3 objects. The resulting semantic configurations are programming: Learning to generalize across hierarchi- formed by: cal tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE, 2018. F =[c(o1 , o2 ), c(o1 , o3 ), c(o2 , o3 ), a(o1 , o2 ), a(o2 , o1 ), a(o1 , o3 ), a(o3 , o1 ), a(o2 , o3 ), a(o3 , o2 )], Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexan- der J Smola. Deep sets. In Advances in neural informa- where c() and a() refer to the close and above predicates tion processing systems, pp. 3391–3401, 2017. respectively and (o1 , o2 , o3 ) are the red, green and blue blocks respectively. 15 JFPDA@PFIA 2021
Grounding Language to Autonomously-Acquired Skills via Goal Generation Symmetry and asymmetry of close and above predi- Algorithm 3 DECSTR: sensorimotor phase G→B. cates. We consider objects o1 and o2 . 1: Require: env E, # buckets Nb , # episodes before • close is symmetric: “o1 is close to o2 " ⇔ “o2 is biased init. nunb , self-evaluation probability pself_eval , close to o1 ". The corresponding semantic mapping noise function σ() function is based on the Euclidean distance, which is 2: Initialize: policy Π, buffer B, goal sampler Gs , bucket symmetric. sampling probabilities pb , language module LGG. • above is asymmetric: “o1 is above o2 " ⇒ not “o2 3: loop is above o1 ". The corresponding semantic mapping 4: self_eval ← random() < pself_eval . If T rue then function evaluates the sign of the difference of the evaluate competence object Z-axis coordinates. 5: g ← Gs .sample(self_eval, pb ) 6: biased_init ← epoch < nunb . Bias initialization 8 The DECSTR algorithm only after nunb epochs 8.1 Intrinsically Motivated Goal- 7: s0 , c0p ← E.reset(biased_init) . c0 : Initial semantic configuration Conditioned RL 8: for t = 1 : T do Overview. Algorithm 3 presents the pseudo-code of the 9: at ← policy(st , ct , g) sensorimotor learning phase (G→B) of DECSTR. It alter- 10: if not self_eval then nates between two steps: 11: at ← at + σ() • Data acquisition. A DECSTR agent has no prior on 12: s , cp ← E.step(at ) t+1 t+1 the set of reachable semantic configurations. Its first 13: episode ← (s, c, a, s0 , c0 ) goal is sampled uniformly from the semantic config- 14: Gs .update(cT ) uration space. Using this goal, it starts interacting 15: B.update(episode) with its environment, generating trajectories of sen- 16: g ← Gs .sample(pb ) sory states s, actions a and configurations cp . The 17: batch ← B.sample(g) last configuration cTp achieved in the episode after T 18: Π.update(batch) time steps is considered stable and is added to the 19: if self_eval then set of reachable configurations. As it interacts with 20: pb ← Gs .update_LP() the environment, the agent explores the configura- tion space, discovers reachable configurations and selects new targets. • Internal models updates. A DECSTR agent updates originally targeted. HER was designed for continuous goal two models: its curriculum strategy and its policy. spaces, but can be directly transposed to discrete goals (Co- The curriculum strategy can be seen as an active goal las et al., 2019a). In our setting, we simply replace the orig- sampler. It biases the selection of goals to target and inally targeted goal configuration by the currently achieved goals to learn about. The policy is the module con- configuration in the transitions fed to SAC. We also use trolling the agent’s behavior and is updated via RL. our automatic curriculum strategy: the LP-C-based prob- abilities are used to sample goals to learn about. When Policy updates with a goal-conditioned Soft Actor- a goal g is sampled, we search the experience buffer for Critic. Readers familiar with Markov Decision Process the collection of episodes that ended in the configuration and the use of SAC and HER algorithms can skip this para- cp = g. From these episodes, we sample a transition uni- graph. formly. The HER mechanism substitutes the original goal We want the DECSTR agent to explore a semantic con- with one of the configurations achieved later in the trajec- figuration space and master reachable configurations in it. tory. This substitute g has high chances of being the sam- We frame this problem as a goal-conditioned MDP (Schaul pled one. At least, it is a configuration on the path towards et al., 2015): M = (S, Gp , A, T , R, γ), where the state this goal, as it is sampled from a trajectory leading to it. space S is the usual sensory space augmented with the The HER mechanism is thus biased towards goals sampled configuration space Cp , the goal space Gp is equal to the by the agent. configuration space Gp = Cp , A is the action space, T : S × A × S → [0, 1] is the unknown transition proba- Object-Centered Inductive Biases. In the proposed bility, R : S × A → {0, 1} is a sparse reward function and Fetch Manipulate environment, the three blocks share the γ ∈ [0, 1] is the discount factor. same set of attributes (position, velocity, color identifier). Policy updates are performed with Soft Actor-Critic (SAC) Thus, it is natural to encode a relational inductive bias in (Haarnoja et al., 2018), a state-of-the-art off-policy actor- our architecture. The behavior with respect to a pair of critic algorithm. We also use Hindsight Experience Replay objects should be independent from the position of the ob- (HER) (Andrychowicz et al., 2017). This mechanism en- jects in the inputs. The architecture used for the policy is ables agents to learn from failures by reinterpreting past depicted in Figure 6. trajectories in the light of goals different from the ones A shared network (N Nshared ) encodes the concatenation of: JFPDA@PFIA 2021 16
Vous pouvez aussi lire