lorenzolucchese / multisig

Signature-based multi-modal (image, video and audio) classifier.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MultiSig

A classification architecture for multi-modal data. Each data modality is tokenized via signature methods. A decoder then performs two-task classification: label and data type. The use of a shared encoder proves especially useful for low-data environments with unbalanced data modalities. Currently supports image (.jpg), video (.mp4) and audio (.wav) data types. The signature tokenizations are extensions of the ideas discussed in ImageSig (https://arxiv.org/abs/2205.06929).

Alt text

The architecture was tested on a (quite unbalanced) dataset with the following structure:

data
├── training_set
│   ├── bird (1000 .jpg / 15 .mp4 / 8 .wav)
│   ├── cat  (5000 .jpg / 65 .mp4 / 5 .wav)
│   └── dog  (5000 .jpg /  2 .mp4 / 4 .wav)
└── test_set
    ├── bird (1000 .jpg /  0 .mp4 / 3 .wav)
    ├── cat  (1000 .jpg /  0 .mp4 / 3 .wav)
    └── dog  (1000 .jpg /  0 .mp4 / 0 .wav)

This work was produced as part of a 2 week industry mini-project in collaboration with DataSig and supervised by Dr Mohamed Ibrahim. Presentation.

About

Signature-based multi-modal (image, video and audio) classifier.

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%