idsia-robotics / leds-as-pretext

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Self-Supervised Learning of Visual Robot Localization Using Prediction of LEDs States as a Pretext Task

Mirko Nava, Nicola Armas, Antonio Paolillo, Jerome Guzzi, Luca Maria Gambardella, and Alessandro Giusti

Dalle Molle Institute for Artificial Intelligence, USI-SUPSI, Lugano (Switzerland)

Abstract

We propose a novel self-supervised approach to learn CNNs that perform visual localization of a robot in an image using very small labeled training datasets. Self-supervision is obtained by jointly learning a pretext task, i.e., predicting the state of the LEDs of the target robot. This pretext task is compelling because: a) it indirectly forces the model to learn to locate the target robot in the image in order to determine its LED states; b) it can be trained on large datasets collected in any environment with no external supervision or tracking infrastructure. We instantiate the general approach to a concrete task: visual relative localization of nano-quadrotors. Experimental results on a challenging dataset show that the approach is very effective; compared to a baseline that does not use the proposed pretext task, it reduces the mean absolute localization error by as much as 78% (43 to 9 pixels on x; 28 to 6 pixels on y).

LEDs as Pretext approach

Figure 1: *Overview of our approach. The model is trained to predict: the drone position in the current frame, by minimizing the end loss (**L**end) defined on **T**l (bottom); and the current state of the four drone LEDs, by minimizing the pretext loss (**L**pretext) defined on **T**l and **T**u (top).*

LEDs as Pretext performance

Figure 2: *On the left, comparison of approaches in terms of MAE (lower is better) and R2 score (higher is better) for the x and y variables. On the right, comparison of baseline (red), LEDs as a Pretext (green), and Upper Bound (blue) models trained with varying amounts of labels. MAE improvement refers to the percentage reduction in MAE between baseline and our LED-P approach. Results obtained by averaging the performance on the x and y variables.*

Video

Self-Supervised Learning of Visual Robot Localization Using Prediction of LEDs States as a Pretext Task

Code

The codebase is avaliable here.

Dataset

The entire dataset is available here as a zipped HDF5 file containing separate groups for labeled and unlabeled training, validation, and testing sets.