IKEA-Dataset

Citation:

If you use this dataset, you might want to cite this paper:

@inproceedings{zhou-etal-2018-visual,
    title = "A Visual Attention Grounding Neural Model for Multimodal Machine Translation",
    author = "Zhou, Mingyang  and
      Cheng, Runxiang  and
      Lee, Yong Jae  and
      Yu, Zhou",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D18-1400",
    doi = "10.18653/v1/D18-1400",
    pages = "3643--3653",
}

Introduction:

IKEA-Dataset is a dataset for multilingual-multimodal machine translation. It is published in this paper: A Visual Attention Grounding Neural Model for Multimodal Machine Translation.

IKEA-Dataset contains the textual and visual data of all products available in IKEA and Under Armour websites in 2017. For each product sample, the textual data is the description of the product, while the visual data is the images of the product. The descriptions are in bilingual pairs: English-French or English-German.

Each data sample in IKEA-Dataset is a bilingual pair of text descriptions and the images of a product.

Data Preprocessing:

This repository contains the raw data and two other versions that underwent different data-processing steps. the IKEA/data.en.*/data.norm.tok.lc folder contains normalized, tokenized, converted to lowercase (processed exclusively in such order) data. The IKEA/data.en.*/data.norm.tok.lc.bpe folder contains normalized, tokenized, converted to lowercase, byte-pair encoding (processed exclusively in such order) data.

Example:

Statistics:

The below statistics is calculated with unprocessed data:

Language pair	Language	Tokens	Minimum sample length	Maximum sample length	Average sample length	Standard derivation sample length	Vocabulary size
English-German	English	256355	6	343	71.40807799	46.33073895	6601
	German	216892	6	324	60.41559889	39.14467817	10468
English-French	English	239966	6	334	72.25715146	47.24279926	6442
	French	275251	6	469	82.88196326	54.72162651	7575

These four histogram show the sentence length distribution for each language in each languague pairs. The length of a sentence is calculate with the number tokens in the sentence:

Characteristics:

Because all data samples are the description of different products from IKEA or Under Armour, a data sample usually contain more than one sentences.
A description might contain information that cannot be showed in image. for example, a description for an Underamour product can contains the sentence “Don’t wash it with hot water”.
A product's text description in German or French might be shorter than its corresponding English version.

Data Format:

Folder:

IKEA/: data crawled and processed from IKEA and UNDERAMOUR.
IKEA/data.en.fr: English-French data.
IKEA/data.en.de: English-German data.
IKEA/data.en.*/data.raw: unprocessed original data compressed in .gz.
IKEA/data.en.*/data.norm.tok.lc: normalized, tokenized and lowercase-converted data.
IKEA/data.en.*/data.norm.tok.lc.bpe: normalized, tokenized, lowercase-converted, byte-pair-encoded (10000) data.
IKEA/data.en.*/data.image.bpe: image matrix for train.*, test.*, val.*.
IKEA/image/image.en.*: compressed images in jpg format for training, validation and testing.

Data Files:

train.*: 2600+ samples for FR, 2800+ samples for DE.
test.*: 330+ samples for FR, 360+ samples for DE.
val.*: 330+ samples for FR, 360+ samples for DE.
IKEA/image/image.en.*/*.[12].zip: each store half of the images for training, validation and testing.
vocab.*: language-corresponded vocabulary file extract from *.norm.tok.lc.10000bpe.*.
*_file.code: language files for byte-pair encoding.
*.norm.tok.lc.10000bpe_ims.npy: corresponded image matrix for train.*, test.*, val.*, each image is stored in a vector of size 2048.

Usages:

It can be a dataset for both text-only machine translation and multimodal machine translation projects. To download the dataset, open the directory where you want to copy the data to on terminal, enter:

$ git clone https://github.com/sampalomad/IKEA-Dataset.git

sampalomad / IKEA-Dataset