Perceiver-Actor

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2022

PerAct is an end-to-end behavior cloning agent that learns to perform a wide variety of language-conditioned manipulation tasks. PerAct uses a Transformer that exploits the 3D structure of voxel patches to learn policies with just a few demonstrations per task.

The best entry-point for understanding PerAct is this Colab Tutorial. If you just want to apply PerAct to your problem, then start with the notebook, otherwise this repo is for mostly reproducing RLBench results from the paper.

For the latest updates, see: peract.github.io

Guides

Getting Started: Installation, Quickstart, Checkpoints and Pre-Generated Datasets, Model Card
Data Generation: Data Generation
Training & Evaluation: Multi-Task Training and Evaluation, Gotchas
Miscellaneous: Recording Videos, Notebooks, Disclaimers, FAQ, Docker Guide, Licenses
Acknowledgements: Acknowledgements, Citations

Hotfix 🔥

Training Speed-Up and Storage Memory Reduction: Ishika found that switching from fp32 to fp16 for storing pickle files dramatically speeds-up training time and significantly reduces memory usage. Checkout her modifications to YARR here.

Installation

Prerequisites

PerAct is built-off the ARM repository by James et al. The prerequisites are the same as ARM.

1. Environment

# setup a virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages peract_env  
source peract_env/bin/activate
pip install --upgrade pip

2. PyRep and Coppelia Simulator

Follow instructions from the official PyRep repo; reproduced here for convenience:

PyRep requires version 4.1 of CoppeliaSim. Download:

Once you have downloaded CoppeliaSim, you can pull PyRep from git:

cd <install_dir>
git clone https://github.com/stepjam/PyRep.git
cd PyRep

Add the following to your ~/.bashrc file: (NOTE: the 'EDIT ME' in the first line)

export COPPELIASIM_ROOT=<EDIT ME>/PATH/TO/COPPELIASIM/INSTALL/DIR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT

Remember to source your bashrc (source ~/.bashrc) or zshrc (source ~/.zshrc) after this.

Warning: CoppeliaSim might cause conflicts with ROS workspaces.

Finally install the python library:

pip install -r requirements.txt
pip install .

You should be good to go! You could try running one of the examples in the examples/ folder.

If you encounter errors, please use the PyRep issue tracker.

3. RLBench

PerAct uses my RLBench fork.

cd <install_dir>
git clone -b peract https://github.com/MohitShridhar/RLBench.git # note: 'peract' branch

cd RLBench
pip install -r requirements.txt
python setup.py develop

For running in headless mode, tasks setups, and other issues, please refer to the official repo.

4. YARR

PerAct uses my YARR fork.

cd <install_dir>
git clone -b peract https://github.com/MohitShridhar/YARR.git # note: 'peract' branch

cd YARR
pip install -r requirements.txt
python setup.py develop

PerAct Repo

Clone:

cd <install_dir>
git clone https://github.com/peract/peract.git

Install:

cd peract
pip install git+https://github.com/openai/CLIP.git
pip install -r requirements.txt

export PERACT_ROOT=$(pwd)  # mostly used as a reference point for tutorials
python setup.py develop

Note: You might need versions of torch==1.7.1 and torchvision==0.8.2 that are compatible with your CUDA and hardware. Later versions should also be fine (in theory).

Quickstart

A quick tutorial on evaluating a pre-trained multi-task agent.

Download a pre-trained PerAct checkpoint trained with 100 demos per task (18 tasks in total):

cd $PERACT_ROOT
sh scripts/quickstart_download.sh

Generate a small val set of 10 episodes for open_drawer inside $PERACT_ROOT/data:

cd <install_dir>/RLBench/tools
python dataset_generator.py --tasks=open_drawer \
                            --save_path=$PERACT_ROOT/data/val \
                            --image_size=128,128 \
                            --renderer=opengl \
                            --episodes_per_task=10 \
                            --processes=1 \
                            --all_variations=True

This will take a few minutes to finish.

Evaluate the pre-trained PerAct agent:

cd $PERACT_ROOT
CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[open_drawer] \
    rlbench.task_name='multi' \
    rlbench.demo_path=$PERACT_ROOT/data/val \
    framework.gpu=0 \
    framework.logdir=$PERACT_ROOT/ckpts/ \
    framework.start_seed=0 \
    framework.eval_envs=1 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=10 \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_type='last' \
    rlbench.headless=False

If you are on a headless machine, turn off the visualization with headless=True.

You can evaluate the same agent on other tasks. First generate a validation dataset like above (or download a pre-generated dataset) and then run eval.py.

Note: The dowloaded checkpoint might not necessarily be the best one for a given task, it's simply the last checkpoint from training.

Download

Pre-Generated Datasets

We provide pre-generated RLBench demonstrations for train (100 episodes), validation (25 episodes), and test (25 episodes) splits used in the paper. If you directly use these datasets, you don't need to run tools/data_generator.py from RLBench. Using these datasets will also help reproducibility since each scene is randomly sampled in data_generator.py.

Is there one big zip file with all splits and tasks instead of individual files? No. My gDrive account will get rate-limited if everyone is directly downloading huge files. I recommend downloading through rclone with Google API Console enabled. The full dataset of zip files is ~116GB.

Pre-Trained Checkpoints

PerAct - 2048 Latents

ID: seed0
Num Tasks: 18
Training Demos: 100 episodes per task (each task includes all variations)
Training Iterations: 600k
Voxel Size: 100x100x100
Cameras: front, left_shoulder, right_shoulder, wrist
Latents: 2048
Self-Attention Layers: 6
Voxel Feature Dim: 64
Data Augmentation: 45 deg yaw perturbations

PerAct - 512 Latents

ID: seed5
Num Tasks: 18
Training Demos: 100 episodes per task (each task includes all variations)
Training Iterations: 600k
Voxel Size: 100x100x100
Cameras: front, left_shoulder, right_shoulder, wrist
Latents: 512
Self-Attention Layers: 6
Voxel Feature Dim: 64

See quickstart guide on how to evaluate these checkpoints. Make sure framework.start_seed is set to the correct ID.

Data Generation

Data generation is pretty similar to the ARM setup, except you use --all_variations=True to sample all task variations:

cd <install_dir>/RLBench/tools
python dataset_generator.py --tasks=open_drawer \
                            --save_path=$PERACT_ROOT/data/train \
                            --image_size=128,128 \
                            --renderer=opengl \
                            --episodes_per_task=100 \
                            --processes=1 \
                            --all_variations=True

You can run these in parallel for multiple tasks. Here is a list of 18 tasks used in the paper (in the same order as results Table 1):

open_drawer
slide_block_to_color_target
sweep_to_dustpan_of_size
meat_off_grill
turn_tap
put_item_in_drawer
close_jar
reach_and_drag
stack_blocks
light_bulb_in
put_money_in_safe
place_wine_at_rack_location
put_groceries_in_cupboard
place_shape_in_shape_sorter
push_buttons
insert_onto_square_peg
stack_cups
place_cups

You can probably train PerAct on more RLBench tasks. These 18 tasks were hand-selected for their diversity in task variations and language instructions.

Warning: Each scene generated with data_generator.py will use a different random seed to configure objects and states in the scene. This means you will get very different train, val, and test sets to the pre-generated ones. This should be fine for PerAct, but you will likely see small differences in evaluation performances. It's recommended to use the pre-generated datasets for reproducibility. Using larger test sets will also help.

Training and Evaluation

The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:

Generate train, val, test datasets with data_generator.py or download pre-generated datasets.
Train agent with train.py and save 10K iteration checkpoints.
Run validation with eval.py with framework.eval_type=missing to find the best checkpoint on val tasks and save results in eval_data.csv.
Evaluate the best checkpoint in eval_data.csv on test tasks with eval.py and framework.eval_type=best. Save final results to test_data.csv.

Make sure you have a train, val, and test set with sufficient demos for the tasks you want to train and evaluate on.

Training

Train a PERACT_BC agent with 100 demos per task for 600K iterations with 8 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py \
    method=PERACT_BC \
    rlbench.tasks=[close_jar,insert_onto_square_peg,light_bulb_in,meat_off_grill,open_drawer,place_cups,place_shape_in_shape_sorter,push_buttons,put_groceries_in_cupboard,put_item_in_drawer,put_money_in_safe,reach_and_drag,stack_blocks,stack_cups,turn_tap,place_wine_at_rack_location,slide_block_to_color_target,sweep_to_dustpan_of_size] \
    rlbench.task_name='multi_18T' \
    rlbench.cameras=[front,left_shoulder,right_shoulder,wrist] \
    rlbench.demos=100 \
    rlbench.demo_path=$PERACT_ROOT/data/train \
    replay.batch_size=1 \
    replay.path=/tmp/replay \
    replay.max_parallel_processes=32 \
    method.voxel_sizes=[100] \
    method.voxel_patch_size=5 \
    method.voxel_patch_stride=5 \
    method.num_latents=2048 \
    method.transform_augmentation.apply_se3=True \
    method.transform_augmentation.aug_rpy=[0.0,0.0,45.0] \
    method.pos_encoding_with_lang=True \
    framework.training_iterations=600000 \
    framework.num_weights_to_keep=60 \
    framework.start_seed=0 \
    framework.log_freq=1000 \
    framework.save_freq=10000 \
    framework.logdir=$PERACT_ROOT/logs/ \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    ddp.num_devices=8

Make sure there is enough disk-space for replay.path and framework.logdir. Adjust replay.max_parallel_processes to fill the replay buffer in parallel based on your resources. You can also train on fewer GPUs, but training will take a long time to converge.

To get started, you should probably train on a small number of rlbench.tasks.

Use tensorboard to monitor training progress with logs inside framework.logdir.

Validation

Evaluate PERACT_BC seed0 on 18 val tasks sequentially (slow!):

CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[close_jar,insert_onto_square_peg,light_bulb_in,meat_off_grill,open_drawer,place_cups,place_shape_in_shape_sorter,push_buttons,put_groceries_in_cupboard,put_item_in_drawer,put_money_in_safe,reach_and_drag,stack_blocks,stack_cups,turn_tap,place_wine_at_rack_location,slide_block_to_color_target,sweep_to_dustpan_of_size] \
    rlbench.task_name='multi_18T' \
    rlbench.demo_path=$PERACT_ROOT/data/val \
    framework.logdir=$PERACT_ROOT/logs/ \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_envs=4 \
    framework.start_seed=0 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=25 \
    framework.eval_type='missing' \
    rlbench.headless=True

This script will slowly go through each 10K interval checkpoint and save success rates in eval_data.csv. To evaluate checkpoints in parallel use framework.eval_envs to start multiple processes.

Testing

CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[close_jar,insert_onto_square_peg,light_bulb_in,meat_off_grill,open_drawer,place_cups,place_shape_in_shape_sorter,push_buttons,put_groceries_in_cupboard,put_item_in_drawer,put_money_in_safe,reach_and_drag,stack_blocks,stack_cups,turn_tap,place_wine_at_rack_location,slide_block_to_color_target,sweep_to_dustpan_of_size] \
    rlbench.task_name='multi_18T' \
    rlbench.demo_path=$PERACT_ROOT/data/test \
    framework.logdir=$PERACT_ROOT/logs/ \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_envs=1 \
    framework.start_seed=0 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=25 \
    framework.eval_type='best' \
    rlbench.headless=True

The final results will be saved in test_data.csv.

Baselines and Ablations

All agents reported in the paper are here along with their respective config files:

Code Name	Paper Name
`PERACT_BC`	PerAct
`C2FARM_LINGUNET_BC`	C2FARM-BC
`VIT_BC_LANG`	Image-BC (VIT)
`BC_LANG`	Image-BC (CNN)

PerAct ablations are set with:

method.no_skip_connection: False
method.no_perceiver: False
method.no_language: False
method.keypoint_method: 'heuristic'

Gotchas

OpenGL Errors

GL errors are probably being caused by the PyRender voxel visualizer. See this issue for reference. You might have to set the following environment variables depending on your setup:

export DISPLAY=:0
export MESA_GL_VERSION_OVERRIDE=4.1
export PYOPENGL_PLATFORM=egl

Unpickling Error

If you see _pickle.UnpicklingError: invalid load key, '\x9e', maybe one of the replay pickle files got corrupted when quitting the training script. Try deleting files in replay.path and restarting training.

Recording Videos

To save high-resolution videos of agent executions, set cinematic_recorder.enabled=True with eval.py:

cd $PERACT_ROOT
CUDA_VISIBLE_DEVICES=0 python eval.py \
    rlbench.tasks=[open_drawer] \
    rlbench.task_name='multi' \
    rlbench.demo_path=$PERACT_ROOT/data/val \
    framework.gpu=0 \
    framework.logdir=$PERACT_ROOT/ckpts/ \
    framework.start_seed=0 \
    framework.eval_envs=1 \
    framework.eval_from_eps_number=0 \
    framework.eval_episodes=3 \
    framework.csv_logging=True \
    framework.tensorboard_logging=True \
    framework.eval_type='last' \
    rlbench.headless=True \
    cinematic_recorder.enabled=True

Videos will be saved at $PERACT_ROOT/ckpts/multi/PERACT_BC/seed0/videos/open_drawer_w600000_s0_succ.mp4.

Note: Rendering at high-resolutions is super slow and will take a long time to finish.

Disclaimers and Limitations

Code quality level: Desperate grad student.
Why isn't your code more modular?: My code, like this project, is end-to-end.
Small test set: The test set should be larger than just 25 episodes. If you parallelize the evaluation, you can easily evaluate on larger test sets and do multiple runs with different seeds.
Parallelization: A lot of things (data generation, evaluation) are slow because everything is done serially. Parallelizing these processes will save you a lot of time.
Impossible tasks: Some tasks like push_buttons are not solvable by PerAct since it doesn't have any memory.
Switch from DP to DDP: For the paper submission, I was using PyTorch DataParallel for multi-gpu training. For this code release, I switched to DistributedDataParallel. Hopefully, I didn't introduce any new bugs.
Collision avoidance: All simulated evaluations use V-REP's internal motion-planner with collision avoidance. For real-world experiments, you have to setup MoveIt to use the voxel grid for avoiding occupied voxels.
YARR Modifications: My changes to the YARR repo are a total mess. Sorry :(
LAMB Optimizer: The LAMB implementation has some issues but still works 🤷. Maybe use FusedLAMB instead.
Other limitations: See Appendix L of the paper for more details.

FAQ

How much training data do I need for real-world tasks?

It depends on the complexity of the task. With 10-20 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 10, 100, 250 demonstrations.

How long should I train the agent for? When will I start seeing good evaluation performance?

This depends on the number, complexity, and diversity of tasks, and also how much compute you have. Take a look at this checkpoint folder containing train_data.csv, eval_data.csv and test_data.csv. These log files should give you a sense of what the training losses look like and what evaluation performances to expect. All multi-task agents in the paper were trained for 600K iterations, and single-task agents were trained for 40K iterations, all with 8-GPU setups.

Why doesn't the agent follow my language instruction?

This means either there is some sort of bias in the dataset that the agent is exploiting (e.g. always 'blue blocks'), or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the voxel grid, then it's going to be hard for agent to figure out what you mean.

How to pick the best checkpoint for real-robot tasks?

Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals. But all real-world experiments in the paper simply chose the last checkpoint.

Can you replace the motion-planner with a learnable module?

Yes, see C2FARM+LPR by James et al.

Why do I need to generate a `val` and `test` set?

Two reasons: (1) One-to-one comparisons between two agents. We can take an episode from the test dataset, and use its random seed to spawn the exact same objects and object pose configurations every time. (2) Checking if the task is actually solvable, at least by an expert. We don't want to evaluate on unsolvable task instances. See issue3 for reference.

Why are duplicate keyframes loaded into the replay buffer?

This is a design choice in ARM (by James et al). I am guessing the keyframes get added several times because they indicate important "phase transitions" between trajectory bottlenecks, and having several copies makes them more likely to be sampled. See issue6.

The training is too slow and the replay pickle files take up too much space. What should I do about this?

Ishika found that switching from fp32 to fp16 for storing pickle files dramatically speeds-up training time and significantly reduces memory usage. Checkout her modifications to YARR here.

Will you release your real-robot code for data-collection and execution?

Checkout franka_htc_teleop.zip for real-robot code. peract_demo_interface.py is for collecting data, and peract_agent_interface.py is for executing trained models. The real-robot datasets are here. See issue18 for more details on the setup, and issue2 for real-world setup details.

Docker Guide

Coming soon...

Notebooks

Colab Tutorial: This tutorial is a good starting point for understanding the data-loading and training pipeline.
Dataset Visualizer: Coming soon ... see Colab for now.
Q-Prediction Visualizer: Coming soon ... see Colab for now.
Results Notebook: Coming soon ...

Hardware Requirements

PerAct agents for the paper were trained with 8 P100 cards with 16GB of memory each. You can use fewer GPUs, but training will take a long time to converge.

Tested with:

GPU - NVIDIA P100
CPU - Intel Xeon (Quad Core)
RAM - 32GB
OS - Ubuntu 16.04, 18.04

For inference, a single GPU is sufficient.

Acknowledgements

This repository uses code from the following open-source projects:

ARM

Original: https://github.com/stepjam/ARM
License: ARM License
Changes: Data loading was modified for PerAct. Voxelization code was modified for DDP training.

PerceiverIO

Original: https://github.com/lucidrains/perceiver-pytorch
License: MIT
Changes: PerceiverIO adapted for 6-DoF manipulation.

ViT

Original: https://github.com/lucidrains/vit-pytorch
License: MIT
Changes: ViT adapted for baseline.

LAMB Optimizer

Original: https://github.com/cybertronai/pytorch-lamb
License: MIT
Changes: None.

OpenAI CLIP

Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to extract token and sentence features.

Thanks for open-sourcing!

Licenses

PerAct License (Apache 2.0) - Perceiver-Actor Transformer
ARM License - Voxelization and Data Preprocessing
YARR Licence (Apache 2.0)
RLBench Licence
PyRep License (MIT)
Perceiver PyTorch License (MIT)
LAMB License (MIT)
CLIP License (MIT)

Release Notes

Update 23-Nov-2022

I ditched PyTorch Lightning and implemented multi-gpu training directly with Pytorch DDP. I could have introduced some bugs during this transition and from refactoring the repo in general.

Update 31-Oct-2022:

I have pushed my changes to RLBench and YARR. The data generation is pretty similar to ARM, except you run data_generator.py with --all_variations=True. You should be able to use these generated datasets with the Colab code.
For the paper, I was using PyTorch DataParallel to train on multiple GPUs. This made the code very messy and brittle. I am currently stuck cleaning this up with DDP and PyTorch Lightning. So the code release might be a bit delayed. Apologies.

Citations

PerAct

@inproceedings{shridhar2022peract,
  title     = {Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation},
  author    = {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
  booktitle = {Proceedings of the 6th Conference on Robot Learning (CoRL)},
  year      = {2022},
}

C2FARM

@inproceedings{james2022coarse,
  title={Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation},
  author={James, Stephen and Wada, Kentaro and Laidlow, Tristan and Davison, Andrew J},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13739--13748},
  year={2022}
}

PerceiverIO

@article{jaegle2021perceiver,
  title={Perceiver io: A general architecture for structured inputs \& outputs},
  author={Jaegle, Andrew and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Doersch, Carl and Ionescu, Catalin and Ding, David and Koppula, Skanda and Zoran, Daniel and Brock, Andrew and Shelhamer, Evan and others},
  journal={arXiv preprint arXiv:2107.14795},
  year={2021}
}

RLBench

@article{james2020rlbench,
  title={Rlbench: The robot learning benchmark \& learning environment},
  author={James, Stephen and Ma, Zicong and Arrojo, David Rovick and Davison, Andrew J},
  journal={IEEE Robotics and Automation Letters},
  volume={5},
  number={2},
  pages={3019--3026},
  year={2020},
  publisher={IEEE}
}

Questions or Issues?

Please file an issue with the issue tracker.