Classifying Multimodal Data using Transformers

Motivation

The increasing prevalence of multimodal data in our society has led to the increased need for machines to make sense of such data holistically. However, data scientists and machine learning engineers aspiring to work on such data face challenges fusing the knowledge from existing tutorials which often deal with each mode separately. Drawing on our experience in classifying multimodal municipal issue feedback in the Singapore government, we conduct a hands-on tutorial to help flatten the learning curve for practitioners who want to apply machine learning to multimodal data.

Dataset

Unfortunately, we are not able to conduct the tutorial using the municipal issue feedback data due to its sensitivity. Instead, we use a subset of the WebVision dataset. This dataset consists of labelled images, together with descriptions of them, crawled from the web. We chose this dataset because of its similar characteristics to our municipal issue feedback data (text descriptions correlate highly with the labels but associated images provide even better context).

Tutorial Outline

In this tutorial, we teach participants how to classify multimodal data consisting of both text and images using Transformers. It is targeted at an audience who have some familiarity with neural networks and are comfortable with writing code.

The outline of the tutorial is as follows:

Sharing of Experience: Municipal issue feedback classification in the Singapore government
Text Classification: Train a text classification model using BERT
Text and Image Classification (v1): Train a dual-encoder text and image classification model using BERT and ResNet-50
Text and Image Classification (v2): Train a joint-encoder text and image classification model using ALign BEfore Fuse (ALBEF)
Question and Answer/Discussion

Running the Notebook

The tutorial will be conducted using Google Colab. We will be using the file multimodal_training.ipynb for the session. To run the notebook on Colab:

Go to the GitHub option and search for dsaidgovsg/multimodal-learning-hands-on-tutorial
Select the main branch
Open multimodal_training.ipynb
Follow the instructions in the cells

Running the Python Script (Optional)

The content in the notebook is meant to be a step-by-step guide to show the difference between the difference model architectures. Thus, the code can be quite repetitve.

We have streamlined the code into a python script which you can run from the terminal to train the models or do prediction from pretrained models.

Steps to run the scripts are as follows:

If you have not already done so, clone this repo to your working directory git clone https://github.com/dsaidgovsg/multimodal-learning-hands-on-tutorial.git
Inside your working directory, run bash prepare_folders_and_download_files.sh . The script will create the folder structure and download the files used during the tutorial into these folders.
Install the libraries required via pip install -r requirements.txt
To do prediction on the test set using the downloaded pretrained models trained for 20 iterations, run python3 multimodal_testing.py
To do your own training and prediction, run python3 multimodal_training.py. Edit the args dictionary in the main function if you want to change the training parameters.

Disclaimer

The following source files in this repo were copied from ALBEF's GitHub repo (click on filename to go to the original file location in ALBEF's GitHub repo):

We copied the files so that our code to train the ALBEF models can be run without having to download and copy source files from another site. We also made minor modifications so that the files are compatible with the latest version of Hugging Face Transformers. The rights and ownership of the code belongs to Salesforce, and ALBEF's author, Junnan Li.

Model Architectures

We will be using three different model architectures in the tutorial. Their architecture diagrams are shown below.

BERT

A text-encoder model which uses only the text to predict the label.

BERT-ResNet

A dual encoder which comprises a separate text encoder (BERT) and an image encoder (ResNet-50).

ALBEF

A joint text-image encoder which aligns the BERT text encoder's embeddings with the image encoder's (Vision Transformers).

Presentation Slides

The slides for the KDD'22 hands-on tutorial session are here.

ujjwalkarn / multimodal-learning-hands-on-tutorial