sampalomad / IKEA-Dataset

A dataset for multimodal machine translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IKEA-Dataset

License: MIT

Citation:

If you use this dataset, you might want to cite this paper:

@inproceedings{zhou-etal-2018-visual,
    title = "A Visual Attention Grounding Neural Model for Multimodal Machine Translation",
    author = "Zhou, Mingyang  and
      Cheng, Runxiang  and
      Lee, Yong Jae  and
      Yu, Zhou",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D18-1400",
    doi = "10.18653/v1/D18-1400",
    pages = "3643--3653",
}

Introduction:

IKEA-Dataset is a dataset for multilingual-multimodal machine translation. It is published in this paper: A Visual Attention Grounding Neural Model for Multimodal Machine Translation.

IKEA-Dataset contains the textual and visual data of all products available in IKEA and Under Armour websites in 2017. For each product sample, the textual data is the description of the product, while the visual data is the images of the product. The descriptions are in bilingual pairs: English-French or English-German.

Each data sample in IKEA-Dataset is a bilingual pair of text descriptions and the images of a product.

Data Preprocessing:

This repository contains the raw data and two other versions that underwent different data-processing steps. the IKEA/data.en.*/data.norm.tok.lc folder contains normalized, tokenized, converted to lowercase (processed exclusively in such order) data. The IKEA/data.en.*/data.norm.tok.lc.bpe folder contains normalized, tokenized, converted to lowercase, byte-pair encoding (processed exclusively in such order) data.

Example:

sample

Statistics:

The below statistics is calculated with unprocessed data:

Language pair Language Tokens Minimum sample length Maximum sample length Average sample length Standard derivation sample length Vocabulary size
English-German English 256355 6 343 71.40807799 46.33073895 6601
German 216892 6 324 60.41559889 39.14467817 10468
English-French English 239966 6 334 72.25715146 47.24279926 6442
French 275251 6 469 82.88196326 54.72162651 7575

These four histogram show the sentence length distribution for each language in each languague pairs. The length of a sentence is calculate with the number tokens in the sentence:

Characteristics:

  • Because all data samples are the description of different products from IKEA or Under Armour, a data sample usually contain more than one sentences.
  • A description might contain information that cannot be showed in image. for example, a description for an Underamour product can contains the sentence “Don’t wash it with hot water”.
  • A product's text description in German or French might be shorter than its corresponding English version.

Data Format:

Folder:

  • IKEA/: data crawled and processed from IKEA and UNDERAMOUR.
  • IKEA/data.en.fr: English-French data.
  • IKEA/data.en.de: English-German data.
  • IKEA/data.en.*/data.raw: unprocessed original data compressed in .gz.
  • IKEA/data.en.*/data.norm.tok.lc: normalized, tokenized and lowercase-converted data.
  • IKEA/data.en.*/data.norm.tok.lc.bpe: normalized, tokenized, lowercase-converted, byte-pair-encoded (10000) data.
  • IKEA/data.en.*/data.image.bpe: image matrix for train.*, test.*, val.*.
  • IKEA/image/image.en.*: compressed images in jpg format for training, validation and testing.

Data Files:

  • train.*: 2600+ samples for FR, 2800+ samples for DE.
  • test.*: 330+ samples for FR, 360+ samples for DE.
  • val.*: 330+ samples for FR, 360+ samples for DE.
  • IKEA/image/image.en.*/*.[12].zip: each store half of the images for training, validation and testing.
  • vocab.*: language-corresponded vocabulary file extract from *.norm.tok.lc.10000bpe.*.
  • *_file.code: language files for byte-pair encoding.
  • *.norm.tok.lc.10000bpe_ims.npy: corresponded image matrix for train.*, test.*, val.*, each image is stored in a vector of size 2048.

Usages:

It can be a dataset for both text-only machine translation and multimodal machine translation projects. To download the dataset, open the directory where you want to copy the data to on terminal, enter:

$ git clone https://github.com/sampalomad/IKEA-Dataset.git

About

A dataset for multimodal machine translation

License:MIT License