action-localization dataset social-media youtube action-recognition computer-vision lifestyle nlp-machine-learning vlogs youtube-api

WhenAct: Temporal Localization of Narrated Actions in Vlogs

This repository contains the dataset, WhenAct, and code for our ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) paper: When did it happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

Task Description

Given a video and its transcript, temporally localize the human actions mentioned in the video.

WhenAct Dataset

Distinguishing between actions that are narrated by the vlogger but not visible in the video and actions that are both narrated and visible in the video (underlined), with a highlight on visible actions that represent the same activity (same color). The arrows represent the temporal alignment between when the visible action is narrated as well as the time it occurs in the video.

Annotation Process

The extraction of actions from the transcripts and their annotation of visible/ not visible in the video is described in detail in this action detection project.
The visible actions are temporally annotated using this open source tool that we built.

Video-clips	Video hours	Transcript words	Visible actions	Non-visible actions
1,246	20	302,316	3,131	10,249

The data is stored here (video urls, action embeddings, I3D features & others).

Data format

The temporal annotations of the visible actions are available at data/dict_all_annotations_ordered.json. The visibility annotations of the actions are available at data/miniclip_actions.json and you can read more about the process in the action detection project.

The visible actions are assigned a start and end time at which they are localized in the miniclip. This does not necessarily correspond to the time the actions are mentioned in the miniclip. The time the actions are mentioned in the miniclip is extracted from the transcript, is the input for the Transcript Alignment method and is found at data/mapped_actions_time_label.json

The miniclip name is formed by concatenating its YouTube channel, playlist, video and miniclip index. For miniclip "4p1_3mini_5.mp4":

10 = channel index
p0 = playlist index (0 or 1) in the channel
10 = video index in the playlist
mini_2 = miniclip index in the video

Example format in JSON:

{
  "10p0_10mini_2.mp4": [
    ["go to first start by and shred up your kale", 26.0, 32.0],
    ["place this into a large bowl", 27.0, 31.0],
    ["break down the cell walls of the kale", 41.0, 51.0],
    ["give it a good massage with your hands", 41.0, 50.0],
    ["add in your butter lettuce", 52.0, 55.0]
  ]
}

key is miniclip
[action name, time start, time end]

Experiments

Check args.py to set the arguments
Run main.py] to run the entire pipeline: data creation, MPU model training, MPU model evaluation and 2SEAL model evaluation
1. 2SEAL evaluation also contains action duration classification SVM method

Citation information

If you use this dataset or any ideas based on the associated research article, please cite the following:

@article{10.1145/3495211,
year = {2021},
issue_date = {January 2022},
publisher = {Association for Computing Machinery},
volume = {1},
number = {1},
issn = {1551-6857},
journal = {ACM Trans. Multimedia Comput. Commun. Appl.}
}

About

Localizing narrated human activities in lifestyle vlogs.

action-localization dataset social-media youtube action-recognition computer-vision lifestyle nlp-machine-learning vlogs youtube-api

MIT License

Languages

Language:Python 100.0%