PointPatchRL - Masked Reconstruction Improves Reinforcement Learning on Point Clouds

Balázs Gyenes1,2, Nikolai Franke1, Philipp Becker1,2, Gerhard Neumann1,2,
1Autonomous Learning Robots (ALR), Karlsruhe Institute of Technology (KIT) 2HIDSS4Health - Helmholtz Information and Data Science School for Health

Abstract

PointPatchRL is a method for Reinforcement Learning on point clouds that harnesses their 3D structure to extract task-relevant geometric information from the scene and learn complex manipulation tasks purely from rewards.

While images are a convenient format for perceiving the environment for RL, they often complicate extracting important geometric details, especially with varying geometries or deformable objects. In contrast, point clouds naturally represent this geometry and easily integrate positional and color data from multiple camera views. However, while deep learning on point clouds has seen many recent successes, RL on point clouds is under-researched, with usually only the simplest encoder architecture considered in the literature.

We introduce PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers. PPRL provides significant improvements compared with other point-cloud architectures previously used for RL. We then complement PPRL with masked reconstruction for representation learning and show that our method outperforms strong model-free and model-based baselines on image observations in complex manipulation tasks containing deformable objects and variations in target object geometry.

Why Point Clouds?

A pointcloud of one faucet A pointcloud of a completely different faucet

Easier to extract task-relevant geometry

Disentangle occluded objects from their occluders

A faucet from one perspective + The same faucet from a different perspective
A point cloud made from combining both perspectives

Combine multiple camera views

Method

PointPatchRL

PointPatchRL is a powerful point cloud encoder in its own right, due to the well-known patching and tokenizing paradigm. As a simple, drop-in replacement for any other point cloud encoder, PointPatchRL increases sample efficiency compared to commonly-used architectures (like PointNet). No need to add a reconstruction loss to the learning pipeline if not desired!

PointPatchRL is a powerful point cloud encoder in its own right

PointPatchRL + Aux

We can add to the strong baseline provided by PointPatchRL by introducing the auto-regressive masked reconstruction loss used in PointGPT. This results in even greater sample efficiency on complex manipulation tasks with multiple (potentially moving) cameras. PointPatchRL + Aux outperforms both point cloud-based and image-based baselines.

PointPatchRL + Aux is superior to point cloud-based and image-based baselines

Policy Videos

Our method outperforms point cloud-based and image-based baselines on 6 challenging manipulation tasks. The tasks contain either deformable objects, or geometric variations across a set of rigid objects.

ThreadInHole env
Thread­In­Hole
DeflectSpheres env
Deflect­Spheres
PushChair env
Push­Chair
OpenCabinetDrawer env
Open­Cabinet­Drawer
OpenCabinerDoor env
Open­Cabiner­Door
TurnFaucet env
Turn­Faucet

Diverse Policies

Agents trained with PPRL + Aux adapt to varying geometries, including handle size and orientation, and whether the door opens to the left or right. The policy coordinates the movements of the gripper and the base and generalizes over varying object geometry.

BibTeX

@inproceedings{gyenes2024pointpatchrl,
  title={PointPatch{RL} - Masked Reconstruction Improves Reinforcement Learning on Point Clouds},
  author={Bal{\'a}zs Gyenes and Nikolai Franke and Philipp Becker and Gerhard Neumann},
  booktitle={8th Annual Conference on Robot Learning},
  year={2024},
  url={https://openreview.net/forum?id=3jNEz3kUSl}
}