RAIVNLab / muscleTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

K Ehsani, D Gordon, T Nguyen, R Mottaghi, A Farhadi

(Project Page) (PDF) (Slides) (Video)

Abstract

Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo, on a variety of target tasks:

  1. Scene classification (semantic)
  2. Action recognition (temporal)
  3. Depth estimation (geometric)
  4. Dynamics prediction (physics)
  5. Walkable surface estimation (affordance)

Installation

  1. Clone the repository using the command:
git clone https://github.com/ehsanik/muscleTorch
cd muscleTorch
  1. Install requirements:
pip3 install -r requirements.txt
  1. Download the images from here and extract it to HumanDataset/images.
  2. Download the sensor data from here and extract it to HumanDataset/annotation_h5.
  3. Download pretrained weights from here for reproducing the numbers in the paper, extract it to HumanDataset/saved_weights.

Dataset

We introduce a new dataset of human interactions for our representation learning framework. We record egocentric videos from a GoPro camera attached to the subjects' forehead. We simultaneously capture body movements, as well as the gaze. We use Tobii Pro2 eye-tracking to track the center of the gaze in the camera frame. We record the body part movements using BNO055 Inertial Measurement Units (IMUs) in 10 different locations (torso, neck, 2 triceps, 2 forearms, 2 thighs, and 2 legs).

The structure of the dataset is as follows:

HumanDataset
└── images
│   └── <video_stamp>
│       └── images_<video_stamp>_<INDEX>.jpg
└── annotation_h5
│   ├── [test/train]_<feature_name>.h5
│   ├── [test/train]_image_name.json
│   ├── [test/train]_h5pyind_2_frameind.json
│   └── [test/train]_timestamp.json
└── saved_weights
    ├── trained_representations
    |   └── <Learned_Representations>.pytar
    └── trained_end_tasks
        ├── Action_Recognition
        ├── Depth_Estimation
        ├── Dynamic_Prediction
        ├── Scene_Classification
        └── Walkable_Surface_Estimation
            └── <Trained_End_Tasks_Weights>.pytar

Training

To train your own model:

python3 main.py --gpu-ids 0 --arch MoCoGazeIMUModel --input_length 5 --sequence_length 5 --output_length 5 \
--dataset HumanContrastiveCombinedDataset --workers 20 --num_classes -1 --loss MoCoGazeIMULoss \
--num_imus 6 --imu_names neck body llegu rlegu larmu rarmu \
--input_feature_type gaze_points move_label --base-lr 0.0005 --dropout 0.5 --data PATHTODATA/human_data 

See scripts/training_representation.sh for additional training scripts.

End-task fineTuning and testing

To test using the pretrained model and reproduce the results in the paper refer to scripts/end_task_representation.sh.

Citation

If you find this project useful in your research, please consider citing:

   @article{ehsani2020learning,
     title={Learning Visual Representation from Human Interactions},
     author={Ehsani, Kiana and Gordon, Daniel and Nguyen, Thomas and Mottaghi, Roozbeh and Farhadi, Ali},
     journal={arXiv},
     year={2020}
   }

About


Languages

Language:Python 98.5%Language:Shell 1.5%