grill-lab / VILT

This repo contains a benchmark collection of tasks and multimodal video content

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VILT: Video Instructions Linking for Complex Tasks

Table of Contents
  1. Overview
  2. Paper
  3. Benchmark Dataset
  4. Change Log
  5. Task
  6. Topics
  7. Video Corpus
  8. Judgments
  9. Evaluation
  10. Future Work

Colab demo showing indexing and evaluation: Open In Colab

Overview

VILT is a new benchmark collection of tasks and multimodal video content. For VILT, we target cooking information needs to assist the user in interactively accomplishing complex real-world tasks. Cooking is an example domain where complex real-world tasks require detailed instructions and guidance to support most users. For example, the instruction ‘Add the melted butter and stir until well combined’, would benefit from a more detailed instructional step on `How to melt butter'.

The video linking collection includes annotations from 10 (recipe) tasks, which the annotators chose from a random subset of the collection of 2,275 high-quality 'Wholefoods' recipes. There are linking annotations for 61 query steps across these tasks which contain cooking techniques, chosen from the 189 total recipe steps. As each method results in approximately 10 videos to annotate, the collection consists of 831 linking judgments.

Paper

This work will be presented at IMuR 2022:

Correct citation:

@inproceedings{fischer2022vilt,
 title={VILT: Video Instructions Linking for Complex Tasks},
 author={Fischer, Sophie and Gemmell, Carlos and Mackie, Iain and Dalton, Jeffery},
 booktitle={Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval (IMuR '22)},
 year={2022}
}

Benchmark Dataset

VILT provides 61 topics for video retrieval:

Colab demo showing indexing and evaluation: Open In Colab

Change Log

Major dataset changes historic users should be aware:

  • 4th August 2022: VILT v1 released.

Task

The VILT is defined as follows:

We aim to linking instructional videos to steps in a task T with multiple steps [S_1, ..., S_N]. Given a step S, we formulate a query Q. For each step query Q, we return a relevance-ranked list of video results [D_1, ..., D_N].

We illustrate the task with the example for tabbouleh salad. This recipe contains the following steps: 'Dice the tomatoes' (S_1), 'Chop parsley and mint leaves' (S_2), 'Cook the bulgur' (S_3) and 'Chop scallions to sprinkle over the salad' (S_4).

For each steps S_1-S_4, there is an underlying cooking technique that the user needs to be able to perform to complete the step successfully. For each of S_1 - S_4, we formulate a query Q for which the system needs to retrieve a relevant video D.

Topics

VILT provides 61 topics that have been annotated, taken from high-quality seed websites on 'Wholefoods.com'. The annotators selected the topic to have an information need, i.e. benefit from an additional instructional video to complete the step.

Each topic contains the recipe title as well as query type. We differentiate between execution steps, i.e. steps required to cook the recipe (s) and requirement steps, which are required to prepare the ingredients for the recipe (r) as mise-en-place.

Video Corpus

We use Common Crawl and OAT to curate a 2,133 video metadata corpus with focused instructional content for detailed Cooking skill instructions. The corpus downloaded here: link

The corpus is released in jsonline format with following fields for each video:

  • title: 'How to' video title
  • id: Unique identifier is the MD5 hash of URL.
  • url: Location of the YouTube video (URL).
  • uploader: YouTube uploaded
  • views: views on YouTube
  • duration: length of video
  • description: Video description provided by uploader
  • description: Video description provided by uploader
  • subtitles: Automatically generated subtitles of the video

Judgments

For VILT, we created 831 video document judgments (13.6 per topic):

Judgment Document Ranking
0 580
1 191
2 60
TOTAL 831

Evaluation

We provide TREC-style query-relevance files: link.

The official measures for the task include MRR, NDCG@10, MAP and P@1.

Future Work

We envision VILT to be an evolving collection, with additional judgments and tasks added in the future. Please suggest any future extensions or bug fixes on github or email (sophie.fischer@glasgow.ac.uk).

About

This repo contains a benchmark collection of tasks and multimodal video content


Languages

Language:Shell 100.0%