RLDS Dataset Conversion

This repo demonstrates how to convert an existing dataset into RLDS format for X-embodiment experiment integration. It provides an example for converting a dummy dataset to RLDS. To convert your own dataset, fork this repo and modify the example code for your dataset following the steps below.

Installation

First create a conda environment using the provided environment.yml file (use environment_ubuntu.yml or environment_macos.yml depending on the operating system you're using):

conda env create -f environment_ubuntu.yml

Then activate the environment using:

conda activate rlds_env

If you want to manually create an environment, the key packages to install are tensorflow, tensorflow_datasets, tensorflow_hub, matplotlib, plotly and wandb.

Run Example RLDS Dataset Creation

Before modifying the code to convert your own dataset, run the provided example dataset creation script to ensure everything is installed correctly. Run the following lines to create some dummy data and convert it to RLDS.

pip3 install -e .
cd example_dataset
python3 create_example_data.py
tfds build

This should create a new dataset in ~/tensorflow_datasets/example_dataset. Please verify that the example conversion worked before moving on.

Converting your Own Dataset to RLDS

Now we can modify the provided example to convert your own data. Follow the steps below:

Rename Dataset: Change the name of the dataset folder from example_dataset to the name of your dataset (e.g. robo_net_v2), also change the name of example_dataset_dataset_builder.py by replacing example_dataset with your dataset's name (e.g. robo_net_v2_dataset_builder.py) and change the class name ExampleDataset in the same file to match your dataset's name, using camel case instead of underlines (e.g. RoboNetV2).
Modify Features: Modify the data fields you plan to store in the dataset. You can find them in the _info() method of the ExampleDataset class. Please add all data fields your raw data contains, i.e. please add additional features for additional cameras, audio, tactile features etc. If your type of feature is not demonstrated in the example (e.g. audio), you can find a list of all supported feature types here. You can store step-wise info like camera images, actions etc in 'steps' and episode-wise info like collector_id in episode_metadata. Please don't remove any of the existing features in the example (except for wrist_image and state), since they are required for RLDS compliance. Please add detailed documentation what each feature consists of (e.g. what are the dimensions of the action space etc.). Note that we store language_instruction in every step even though it is episode-wide information for easier downstream usage (if your dataset does not define language instructions, you can fill in a dummy string like pick up something).
Modify Dataset Splits: The function _split_paths() determines the splits of the generated dataset (e.g. training, validation etc.). If your dataset defines a train vs validation split, please provide the corresponding file paths, e.g. by pointing to the corresponding folders (like in the example). If your dataset does not define splits, remove the val split and only include the train split.
Modify Dataset Conversion Code: Next, modify the function _generate_examples(). Here, your own raw data should be loaded, filled into the episode steps and then yielded as a packaged example. Your iterator can yield multiple examples for each input file path. Note that the value of the first return argument, episode_path in the example, is only used as a sample ID in the dataset and can be set to any value that is connected to the particular stored episode, or any other random value. Just ensure to avoid using the same ID twice.
Provide Dataset Description: Next, add a bibtex citation for your dataset in CITATIONS.bib and add a short description of your dataset in README.md inside the dataset folder. You can also provide a link to the dataset website and please add a few example trajectory images from the dataset for visualization.
Add Appropriate License: Please add an appropriate license to the repository. Most common is the CC BY 4.0 license -- you can copy it from here.

That's it! You're all set to run dataset conversion. Before starting the processing, you need to install your dataset package by modifying example_dataset to the name of your dataset in setup.py and running pip install -e. Then, make sure that no GPUs are used during data processing (export CUDA_VISIBLE_DEVICES=) and inside the dataset directory, run:

tfds build --overwrite

The command line output should finish with a summary of the generated dataset (including size and number of samples). Please verify that this output looks as expected and that you can find the generated tfrecord files in ~/tensorflow_datasets/<name_of_your_dataset>.

Parallelizing Data Processing

By default, dataset conversion uses 10 parallel workers. If you are parsing a large dataset, you can increase the number of used workers by increasing N_WORKERS in the dataset class. Try to use slightly fewer workers than the number of cores in your machine (run htop in your command line if you don't know how many cores your machine has).

The dataset value MAX_PATHS_IN_MEMORY controls how many filepaths will be processed in parallel before they get written to disk sequentially. As a rule of thumb, setting this value as high as possible will make dataset conversion faster, but don't set it too high to not overflow the memory of your machine. Setting it to >10-20x the number of workers is usually a good default. You can monitor htop during conversion and reduce the value in case your memory overflows.

Visualize Converted Dataset

To verify that the data is converted correctly, please run the data visualization script from the base directory:

python3 visualize_dataset.py <name_of_your_dataset>

This will display a few random episodes from the dataset with language commands and visualize action and state histograms per dimension. Note, if you are running on a headless server you can modify WANDB_ENTITY at the top of visualize_dataset.py and add your own WandB entity -- then the script will log all visualizations to WandB.

Add Transform for Target Spec

For X-embodiment training we are using specific inputs / outputs for the model: input is a single RGB camera, output is an 8-dimensional action, consisting of end-effector position and orientation, gripper open/close and a episode termination action.

The final step in adding your dataset to the training mix is to provide a transform function, that transforms a step from your original dataset above to the required training spec. Please follow the two simple steps below:

Modify Step Transform: Modify the function transform_step() in example_transform/transform.py. The function takes in a step from your dataset above and is supposed to map it to the desired output spec. The file contains a detailed description of the desired output spec.
Test Transform: We provide a script to verify that the resulting transformed dataset outputs match the desired output spec. Please run the following command: python3 test_dataset_transform.py <name_of_your_dataset>

If the test passes successfully, you are ready to upload your dataset!

Upload Your Data

We provide a Google Cloud bucket that you can upload your data to. First, install gsutil, the Google cloud command line tool. You can follow the installation instructions here.

Next, authenticate your Google account with:

gcloud auth login

This will open a browser window that allows you to log into your Google account (if you're on a headless server, you can add the --no-launch-browser flag). Ideally, use the email address that you used to communicate with Karl, since he will automatically grant permission to the bucket for this email address. If you want to upload data with a different email address / google account, please shoot Karl a quick email to ask to grant permissions to that Google account!

After logging in with a Google account that has access permissions, you can upload your data with the following command:

gsutil -m cp -r ~/tensorflow_datasets/<name_of_your_dataset> gs://xembodiment_data

This will upload all data using multiple threads. If your internet connection gets interrupted anytime during the upload you can just rerun the command and it will resume the upload where it was interrupted. You can verify that the upload was successful by inspecting the bucket here.

The last step is to commit all changes to this repo and send Karl the link to the repo.

Thanks a lot for contributing your data! :)

mees / bridge_rlds_builder