Di-SkilL

Abstract

Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose Diverse Skill Learning (Di-SkilL), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.

Diverse Skill Learning

Automatic Curriculum Learning using Energy-Based per-Expert Context Distribution

Figure 1 depicts the sampling procedure of Di-SkilL.During Inference the agent observes contexts $\boldsymbol{c}$ from the environment's unknown context distribution $p(\boldsymbol{c})$. The agent calculates the gating probabilities $\pi(o|\boldsymbol{c})$ for each context and samples an expert $o$ resulting in $(o, \boldsymbol{c})$ samples marked in blue. During Training we first sample a batch of contexts $\boldsymbol{c}$ from $p(\boldsymbol{c})$, which is used to calculate the per-expert context distribution $\pi(\boldsymbol{c}|o)$ for each expert $o = 1,..., K$. The $\pi(\boldsymbol{c}|o)$ provides a higher probability for contexts preferred by the expert $\pi(\boldsymbol{\theta}|\boldsymbol{c}, o)$. To enable curriculum learning, we provide each expert the contexts sampled from its corresponding $\pi(\boldsymbol{c}|o)$, resulting in the samples $(o, \boldsymbol{c}_T)$ marked in orange. In both cases, the chosen $\pi(\boldsymbol{\theta}|\boldsymbol{c}, o)$ samples motion primitive parameters $\boldsymbol{\theta}$ for each context, resulting in a trajectory $\tau$ that is subsequently executed on the environment. Before execution, the corresponding context, e.g., the goal position of a box, needs to be set in the environment. This is illustrated by the dashed arrows, with the context in blue for inference and orange for training.

Specializing and Overlapping Context Distributions lead to Diverse Skills

Figure 2: (a) High-probability regions of the individual per-expert context distributions, where a color represents an expert $o$. (b) Number of active experts for context regions.

The maximum entropy-based objective allows learning diverse skills to the same or similar tasks defined by the contexts. For this, the per-expert context distributions need to specialize in a sub region of the context context space (a), but at the same time overlapping regions are necessary to learn diverse skills to similar tasks (b). These properties are ensured by the decomposed objective (please see the paper).

Tasks

5-Link Reacher

The 5-Link reacher task is an extension of the classical 2-Link reacher task from OpenAI Gym. The reacher has to reach a goal position with its tip within all quadrants in the context space. A significant challenge in this task is the time-sparse reward that provides only a reward signal at the end of the episode.

The following video shows diverse reaching skills learned by Di-SkilL. The skills were sampled during inference time from the gating distribution.

Box Pushing with Obstacles

In the Box Pushing with Obstacle task a 7-DoF robot is tasked to push a box to a target position and rotation while avoiding an obstacle. The 5-dimensional context consists of the box's target position, orientation and the obstacle's position. The task is additionally challenging due to the time-sparse reward structure.

The following video shows diverse reaching skills learned by Di-SkilL. The skills were sampled during inference time from the gating distribution.

Hopper Jump

The Hopper from OpenAI Gym is tasked to jump as high as possible while landing in a goal position as marked by the green and red dots. This task has a non-markovian reward structure which makes learning skills with step-based approaches infeasible.

The following videos show the behaviors of Di-SkilL's individual experts. We have sampled contexts from each per-expert context distribution and exectued the corresponding expert. The goal of the videos is to show that each expert is learning different skills. This first expert (left) builds momentum for the jump by using the first joint and stabilizes by landing on its foot.
This second expert (middle) builds momentum for the jump by using the first joint and stabilizes by landing on the hopper's "head". The expert is responsible for landing positions that are further away from the initial position.
This third expert (right) builds momentum for the jump by using the first joint and stabilizes by landing on the hopper's "head". The expert is responsible for landing positions that are next to the initial position.

Table Tennis

In the table tennis task a 7-degree of freedom (DoF) robot has to learn fast and precise motions to smash the ball to a desired position on the opponent's side. The 5-dimensional. context consists of the incoming ball's landing position, the desired landing position on the opponent's side and the ball's initial velocity. The table tennis environment requires good exploratory behavior and has a non-makrovian reward structure making step-based approaches infeasible to learn usefull skills.

The videos blow show diverse striking skills learned by Di-SkilL. For each of the videos the ball's landing position on the opponent's side is fixed and the ball's inital landing position and velocity are varied. The shown skills correspond to executing the experts sampled from the gatinng distribution during inference.

Robot Mini Golf

The 7-DoF robot is tasked to hit the ball in an environment with two obstacles, where the blue obstacle is static and the green is reset in each episode. The ball has to pass the tight goal on the other side of the table to achieve a success. This environment has a non-markovian reward structure which makes learning difficult.

The following video shows diverse skills where the goal is fixed and the ball's and the obstacle's initial positions are varied. The experts are sampled from the gating distribution during inference.

BibTeX

@inproceedings{ celik2024acquiring, title={Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts}, author={Onur Celik and Aleksandar Taranovic and Gerhard Neumann}, booktitle={Forty-first International Conference on Machine Learning}, year={2024}, url={https://openreview.net/forum?id=9ZkUFSwlUH} }