computer-vision deep-learning gaze-estimation intent-classification machine-learning object-detection contactless

Multi-Stage Hybrid-CNN Transformer Model for Human Intent-Prediction

This repo contains all the models (including the variant) developed in the project Multi-Stage Hybrid-CNN Transformer Model for Human Intent-Prediction. As an overview, the Multi-Stage Hybrid-CNN Transformer Classifier System is composed of two key components: the Gazed Object Detector and the Intent Classifier.

The Gazed Object Detector is in the gazed_object_detectors which contains the three different variations of the model.
The Intent Classifier is in the intent_classifier folder.
The Overall System for inference is in the multi-stage_human_intent_classifier_system folder.

Each folder has its own readme.md for guidance.

Dataset

The dataset that was used in this project can be accessed here. The generators and statistics for the train-test split are in the split folder.

Recommendations

Add more video samples such that the gaze distribution is balanced ("None" or not looking at objects is currently overrepresented)
Develop the weights of the Gaze Object Detector from scratch to tailor the model for object-gaze classification
Consider the probability of gaze for all objects in a given frame, instead of the most probable gaze, as input to the human intent classifier
Explore other human pose estimation techniques (as inspired by the increase in performance from the additional head information used)

Acknowledgement

We are extremely grateful to the work of DETR and MGTR, where the gazed object detector was heavily based from.

About

An Undergraduate Capstone Project under the Digital Signal Processing Laboratory of the University of the Philippines Diliman

computer-vision deep-learning gaze-estimation intent-classification machine-learning object-detection contactless

Languages

Language:Python 74.8%Language:Jupyter Notebook 25.2%