vhzy / videoqa_dataset_visualization

Load and visualize different datasets in video question answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Video question answering dataset visualization


Load and visualize different datasets in video question answering.

  1. Project structure
  2. Dataset structure
  3. Install
  4. Usage
  5. Dataset
  6. Result

Project structure

    |──results              # result image
    |──mainwindow.py        # ui file based on PyQt5
    |──mainwindow.ui        # can be modified with Qt Creator
    |──msvd_visualize.py    # msvd-qa dataset visualization
    |──msrvtt_visualize.py  # msrvtt-qa dataset visualization
    |──tgif_visualize.py    # tgif-qa dataset visualization
    |──ui.py                # use GUI to visualization above datasets
    |──vqa.png              # material picture

Dataset structure

            |── ...
            |── ...
            |── ...


  1. Clone the project
git clone https://github.com/Horizon2333/videoqa_dataset_visualization
cd videoqa_dataset_visualization
  1. Install dependencies
pip install -r requirements.txt


  1. Visualize individual dataset with matplotlib :
python msvd_visualize.py --path {your msvd-qa dataset path such as F:/Dataset/MSVD-QA}
python msrvtt_visualize.py --path {your msrvtt-qa dataset path such as F:/Dataset/MSRVTT-QA}
python tgif_visualize.py --path {your tgif-qa dataset path such as F:/Dataset/tgif-qa}
  1. Use GUI to visualize different dataset:
python ui.py



Dataset size: ~1.7G

MSVD Official link , MSVD video link (have a better name format), MSVD-QA annotation link


Annotation format : dict stored in json file.

Loading annotation:

import json

with open("val_qa.json") as f:
    annotation = json.load(f)

The example of one annotation is like:

>>> annotation[0]
{'answer': 'someone', 'id': 30933, 'question': 'who pours liquid from a plastic container into a ziploc bag containing meat pieces?', 'video_id': 1201}

Get video name, question and answer:

video_name = 'vid' + annotation[index]['video_id'] + '.avi' 
# video_name = 'vid1201.avi'
question = annotation[index]['question']                    
# question = 'who pours liquid from a plastic container into a ziploc bag containing meat pieces?'
answer = annotation[index]['answer']
# answer = 'someone'

Then we can use video name and video path to load videos with the help of opencv.


Dataset size: ~6.3G

MSRVTT video and annotation link , MSRVTT-QA annotation link


There are only few differences between MSRVTT-QA dataset annotation and MSVD-QA dataset annotation.

Annotation format : dict stored in json file.

Loading annotation:

import json

with open("val_qa.json") as f:
    annotation = json.load(f)

The example of one annotation is like:

>>> annotation[0]
{'answer': 'couch', 'category_id': 14, 'id': 158581, 'question': 'what are three people sitting on?', 'video_id': 6513}

Get video name, question and answer:

video_name = 'video' + annotation[index]['video_id'] + '.mp4' 
# video_name = 'video6513.mp4'
question = annotation[index]['question']                    
# question = 'what are three people sitting on?'
answer = annotation[index]['answer']
# answer = 'couch'

Then we can use video name and video path to load videos with the help of opencv.


Dataset size: ~123G

tgif-qa gif and annotation link


Dataset tgif-qa have 4 different types of QA pair, so the annotation format is also different.

Annotation format : array stored in csv file with delimiter \t.


Loading annotation:

import numpy as np

tgif_test_action_annotation = np.loadtxt("Test_action_question.csv", dtype=str, delimiter='\t')

The first line of the csv is the content of different columns:

>>> tgif_test_action_annotation[0]
array(['gif_name', 'question', 'a1', 'a2', 'a3', 'a4', 'a5', 'answer', 'vid_id', 'key'], dtype='<U73')

The above output means that action is a multi-choice type task.

The example of one annotation is like:

>>> tgif_test_action_annotation[1]
       'What does the butterfly do 10 or more than 10 times ?',
       'stuff marshmallow', 'holds a phone towards face', 'fall over',
       'talk', 'flap wings', '4', 'ACTION4', '26'], dtype='<U73')

Get gif name, question and answer:

gif_name = tgif_test_action_annotation[index][0] + '.gif' 
# video_name = 'tumblr_nk172bbdPI1u1lr18o1_250.gif'
question = tgif_test_action_annotation[index][1]                    
# question = 'What does the butterfly do 10 or more than 10 times ?'
multi_choice = tgif_test_action_annotation[index][2:7]
# multi_choice = array(['stuff marshmallow', 'holds a phone towards face', 'fall over', 'talk', 'flap wings'], dtype='<U73')
answer = tgif_test_action_annotation[index][7]
# answer = '4', means correct answer is 'flap wings'.

Loading annotation:

import numpy as np

tgif_test_count_annotation = np.loadtxt("Test_count_question.csv", dtype=str, delimiter='\t')

The first line of the csv is the content of different columns:

>>> tgif_test_count_annotation[0]
array(['gif_name', 'question', 'answer', 'vid_id', 'key'], dtype='<U97')

The above output means that count is a open-ended type task.

The example of one annotation is like:

>>> tgif_test_count_annotation[1]
       'How many times does the man adjust waistband ?', '3', 'COUNT12',
       '52'], dtype='<U97')

Get gif name, question and answer:

gif_name = tgif_test_count_annotation[index][0] + '.gif' 
# video_name = 'tumblr_nezfs4uELd1u1a7cmo1_250.gif'
question = tgif_test_count_annotation[index][1]                    
# question = 'How many times does the man adjust waistband ?'
answer = tgif_test_count_annotation[index][2]
# answer = '3'

Type frameqa is like type count. Type transition is like type action.


Visualize MSVD-QA:


Visualize MSRVTT-QA:


Visualize tgif-qa:


Visualize GUI:


If there are something wrong with my code or any questions, please tell me, thanks a lot!


Load and visualize different datasets in video question answering

License:MIT License


Language:Python 100.0%