computer-vision natural-language-processing natural-language-understanding image-classification image-analysis graphic-design infographics visualizations text-detection object-detection icons image-understanding text-summarization image-captioning computer-graphics crowdsourcing image-tagging

Visually29K: a large-scale curated infographics dataset

In this repository, we provide metadata, annotations, and processing scripts for tens of thousands of infographics, for computer vision and natural language research. What kinds of applications can this data be used for? Category (topic) prediction, tagging, popularity prediction (likes & shares), text understanding and summarization, captioning, icon and object detection, design summarization and retargeting. We provide starter code for a subset of these applications, and provide metadata including text detections, icon detections, tag and category annotations available in different formats to make this data easy to use and adaptable to different tasks.

This repo is associated with the following project page: http://visdata.mit.edu/ and the manuscripts: "Synthetically Trained Icon Proposals for Parsing and Summarizing Infographics" and "Understanding infographics through textual and visual tag prediction".

Infographics & metadata

infographics60K_metadata.pckl contains metadata for 63K infographics, including: URL for download, category, tags, title, description, likes, shares, views, transcript (if available). The notebook howto.ipynb shows how to load this data.
detectionsList_classified.pckl contains the icon detections (bounding box coordinates) inside 63K infographics generated by our synthetically-trained icon detector, along with the top-5 tags assigned to each detection (using our trained icon classifier). The notebook plot_icon_detections.ipynb shows how to plot the detections.
Dictionary_words_to_relevant_infographics.pickle contains a dictionary that maps over 340 different tag classes to the infographics that contain either text or icons matching each tag, as well as the bounding box coordinates of the matching text and icons. See save_tag_to_relevant_infographics.ipynb for how this data was generated using the detections, and how this data is used for a retrieval application.
raw_ocr_output.pickle¹ contains the extracted text for over 63K infographics along with the bounding boxes of individual words (thanks to Google's OCR API). google_text_extraction_output.pckl¹ contains a simplified version of this data, containing a dictionary mapping each infographic to a list of words found in it with some minimal post-processing. plot_text_detections.ipynb shows how to plot the detections.

Crowdsourced icon annotations

icon_annotations_all.pickle includes ground truth bounding boxes for the icons inside 1,400 infographics, annotated by human participants (a total of 21K bounding boxes, or an average of 15 per infographic). This data is also split into testing (icon_annotations_testing.pickle) and validation (icon_annotations_validation.pickle) files for reporting purposes. Note that no annotated training files are provided because we trained an icon detector using synthetic data (https://arxiv.org/pdf/1807.10441).
tag_conditional_annotations.pickle contains a subset of 544 infographics with tag-conditional icon annotations collected using a separate crowdsourcing experiment, where human participants annotated icons matching particular tags in the infographics (for a total of 835 image-tag pairs producing 7.7K bounding boxes).
icon_annotations_4consistency.pickle contains the additional annotations of 5 human annotators on 55 images for measuring human consistency on the icon annotation task (as computed in human_annotation_consistency.ipynb)

Icon dataset

a dataset of 250K icon images downloaded from Google images that cover 391 different tag classes, spit into training and validation sets, used for training an icon classifier and for generating synthetic data for icon detection (https://arxiv.org/pdf/1807.10441).

¹ Links to large data files that could not be directly stored in this repository can be found in links_to_data_files.md

Starter files and scripts

howto.ipynb shows how to parse the metadata for the infographics. Note that we do not provide the infographics themselves, as they are property of Visual.ly, but we do provide URLs for the infographics and a way to obtain them. We also provide the train/test splits which we used for category and tag prediction. The metadata contains attributes that we did not use for our prediction tasks, including popularity indicators (likes, shares, views), and designer-provided titles and captions.
plot_text_detections.ipynb plots detected and parsed text (via Google's OCR API) on top of the infographics, and demonstrates the few different formats we make available from which the parsed text data can be loaded. This text can be a rich resource for natural language processing tasks like captioning and summarization.
plot_icon_detections.ipynb loads in our automatically-computed icon detections and classifications for 63K infographics (note that for reporting purposes, only the results on the test set of the 29K subset of infographics are used). These detections and classifications can either be used as a baseline to improve upon, or be used directly as input to new tasks like captioning, retargeting, or summarization.
plot_human_annotations.ipynb loads in data for 1.4K infographics that we collected using crowdsourced (Amazon's Mechanical Turk) annotation tasks. Specifically, we asked participants to annotate the locations of icons inside the infographics. Additionally, human_annotation_consistency.ipynb provides some scripts for computing consistency between participants at this annotation task. This data is meant to be used as a ground truth for evaluation of computational models.
save_tag_to_relevant_infographics.ipynb contains scripts to find and plot the infographics that match different text queries, for a demo retrieval application. Search engines typically use meta-data to determine which images to serve based on a search query. They do not look inside the image. In contrast, our automatically pre-computed detections allow us to find the infographics that contain matching text and icons.

Featured projects

featured_projects.md contains links to other repositories that use our Visually29K dataset, to seed new project ideas and give students and researchers some potential starting points for projects.

If you use the data or code in this git repo, please consider citing:

@inproceedings{visually2,
    author    = {Spandan Madan*, Zoya Bylinskii*, Matthew Tancik*, Adrià Recasens, Kimberli Zhong, Sami Alsheikh, Hanspeter Pfister, Aude Oliva, Fredo Durand}
    title     = {Synthetically Trained Icon Proposals for Parsing and Summarizing Infographics},
    booktitle = {arXiv preprint arXiv:1807.10441},
    url       = {https://arxiv.org/pdf/1807.10441},
    year      = {2018}
}
@inproceedings{visually1,
    author    = {Zoya Bylinskii*, Sami Alsheikh*, Spandan Madan*, Adria Recasens*, Kimberli Zhong, Hanspeter Pfister, Fredo Durand, Aude Oliva}
    title     = {Understanding infographics through textual and visual tag prediction},
    booktitle = {arXiv preprint arXiv:1709.09215},
    url       = {https://arxiv.org/pdf/1709.09215},
    year      = {2017}
}

About

A large-scale curated dataset of Visual.ly infographics with metadata and additional crowdsourced annotations for research applications in computer vision and natural language processing.

http://visdata.mit.edu

computer-vision natural-language-processing natural-language-understanding image-classification image-analysis graphic-design infographics visualizations text-detection object-detection icons image-understanding text-summarization image-captioning computer-graphics crowdsourcing image-tagging

MIT License

Languages

Language:Jupyter Notebook 93.2%Language:Python 6.4%Language:C 0.2%Language:Cuda 0.2%Language:C++ 0.0%Language:Shell 0.0%