Youtube Gesture Dataset

This repository contains scripts to build Youtube Gesture Dataset. You can download Youtube videos and transcripts, divide the videos into scenes, and extract human poses. Please see the project page and paper for the details.

[Project page] [Paper]

If you have any questions or comments, please feel free to contact me by email (youngwoo@etri.re.kr).

Environment

The scripts are tested on Ubuntu 16.04 LTS and Python 3.5.2.

Dependencies

OpenPose (v1.4) for pose estimation
PySceneDetect (v0.5) for video scene segmentation
OpenCV (v3.4) for video read
- We uses FFMPEG. Use latest pip version of opencv-python or build OpenCV with FFMPEG.

Gentle (Jan. 2019 version) for transcript alignment

Download the source code from Gentle github and run ./install.sh. And then, you can import gentle library by specifying the path to the library. See run_gentle.py.

Add an option -vn to resample.py in gentle as follows:

cmd = [
    FFMPEG,
    '-loglevel', 'panic',
    '-y',
] + offset + [
    '-i', infile,
] + duration + [
    '-vn',  # ADDED (it blocks video streams, see the ffmpeg option)
    '-ac', '1', '-ar', '8000',
    '-acodec', 'pcm_s16le',
    outfile
]

A step-by-step guide

Set config
- Update paths and youtube developer key in config.py (the directories will be created if not exist).
- Update target channel ID. The scripts are tested for TED and LaughFactory channels.
Execute download_video.py
- Download youtube videos, metadata, and subtitles (./videos/*.mp4, *.json, *.vtt).
Execute run_openpose.py
- Run OpenPose to extract body, hand, and face skeletons for all vidoes (./skeleton/*.pickle).
Execute run_scenedetect.py
- Run PySceneDetect to divide videos into scene clips (./clip/*.csv).
Execute run_gentle.py
- Run Gentle for word-level alignments (./videos/*_align_results.json).
- You should skip this step if you use auto-generated subtitles. This step is necessary for the TED Talks channel.
Execute run_clip_filtering.py
- Remove inappropriate clips.
- Save clips with body skeletons (./clip/*.json).
(optional) Execute review_filtered_clips.py
- Review filtering results.
Execute make_ted_dataset.py
- Do some post processing and split into train, validation, and test sets (./script/*.pickle).

Pre-built TED gesture dataset

Running whole data collection pipeline is complex and takes several days, so we provide the pre-built dataset for the videos in the TED channel.


Number of videos	1,766
Average length of videos	12.7 min
Shots of interest	35,685 (20.2 per video on average)
Ratio of shots of interest	25% (35,685 / 144,302)
Total length of shots of interest	106.1 h

[ted_raw_poses.zip] [z01] [z02] [z03] [z04] [z05] (split zip files, Google Drive or OneDrive links, total 80.9 GB)
The result of Step 3. It contains the extracted human poses for all frames.
[ted_shots_of_interest.zip, 13.3 GB]
The result of Step 6. It contains shot segmentation results ({video_id}.csv files) and shots of interest ({video_id}.json files). 'clip_info' elements in JSON files have start/end frame numbers and a boolean value indicating shots of interest. The JSON files contain the extracted human poses for the shots of interest, so you don't need to download ted_raw_poses.zip unless the human poses for all frames are necessary.
[ted_gesture_dataset.zip, 1.1 GB]
The result of Step 8. Train/validation/test sets of speech-motion pairs.

Download videos and transcripts

We do not provide the videos and transcripts of TED talks due to copyright issues. You should download actual videos and transcripts by yourself as follows:

Download and copy [video_ids.txt] file which contains video ids into ./videos directory.
Run download_video.py. It downloads the videos and transcripts in video_ids.txt. Some videos may not match to the extracted poses that we provided if the videos are re-uploaded. Please compare the numbers of frames, just in case.

Citation

If our code or dataset is helpful, please kindly cite the following paper:

@INPROCEEDINGS{
  yoonICRA19,
  title={Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots},
  author={Yoon, Youngwoo and Ko, Woo-Ri and Jang, Minsu and Lee, Jaeyeon and Kim, Jaehong and Lee, Geehyuk},
  booktitle={Proc. of The International Conference in Robotics and Automation (ICRA)},
  year={2019}
}

Related Projects

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020), https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context

Acknowledgement

This work was supported by the ICT R&D program of MSIP/IITP. [2017-0-00162, Development of Human-care Robot Technology for Aging Society]
Thanks to Eun-Sol Cho and Jongwon Kim for contributions during their internships at ETRI.

youngwoo-yoon / youtube-gesture-dataset