DeepEgoHA-3R

Deep Egocentric Human Action Anticipation & Activity Recognition Research Publications

Welcome to the DeepEgoHA^3R repository! Explore the forefront of Egocentric Human Activity Recognition (HAR) and Action Anticipation through Deep Learning. This collection features cutting-edge publications in multimodal computer vision, a focus of ongoing Ph.D. research at "the Vision Lab" at the University of Thessaly, Lamia. Dive into advancements in egocentric vision, particularly HAR and action anticipation. Contribute to our ongoing research by adding publications or insights. Let's collaborate at the intersection of deep learning and egocentric vision. Stay tuned for updates as we collectively navigate this captivating exploration!

Inspiration and Credits

This repository drew inspiration from Egocentric Vision repository, which provided valuable insights and ideas for exploring Egocentric Human Action Anticipation & Activity Recognition. We extend our gratitude to the contributors and researchers behind that repository for their work in advancing the field.

Sections

Surveys
Egocentric Vision
Action Anticipation
Deep Learning Methodologies and Architectures
Transfer Learning
Datasets
Challenges
Theses and dissertations

Surveys

Deep-learning Based Egocentric Action Anticipation: A Survey - Richard Wardle, Sareh Rowlands, 2023.
Video Understanding with Large Language Models:A Survey - Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu, 2024
A Survey on Deep Learning Techniques for Action Anticipation - Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer, 2023.
Transfer Learning in Human Activity Recognition: A Survey - Sourish Gunesh Dhekane, Thomas Ploetz, 2024.
Visual features for ego-centric activity recognition: A survey - Girmaw Abebe Tadesse, Andrea Cavallaro, 2018.
Predicting the Future from First Person (Egocentric) Vision: A Survey - Ivan Rodin, Antonino Furnari, Dimitrios Mavroedis, Giovanni Maria Farinella, 2017.
Long-term Action Anticipation: A Quick Survey - Zhong, Zeyun, 2022.
A Survey on Recent Advances of Computer Vision Algorithms for Egocentric Video - Sven Bambach, 2015.
Egocentric Vision-based Action Recognition: A survey - Adrián Núñez-Marcos, Gorka Azkune, Ignacio Arganda-Carreras, 2022.
Analysis of the hands in egocentric vision:A survey - Andrea Bandini, José Zariffa, 2019.
A Survey of Activity Recognition in Egocentric Lifelogging datasets - El Asnaoui Khalid, Aksasse Hamid, Aksasse Brahim, Ouanan Mohammed, 2017.
Optical flow and scene flow estimation: A survey - Mingliang Zhai, Xuezhi Xiang, Ning Lv, Xiangdong Kong, 2021.
AutoML: A Survey of the State-of-the-Art - Xin He, Kaiyong Zhao, Xiaowen Chu, 2021.
A Survey of Deep Learning: From Activations to Transformers - Johannes Schneider, Michalis Vlachos, 2023.
Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions - Iqbal H. Sarker, 2021.
A survey on deep learning and its applications - Shi Dong, Ping Wang, Khushnood Abbas, 2021.

Egocentric Vision

An Introduction to the 3rd Workshop on Egocentric (First-person) Vision - Steve Mann, Kris M. Kitani, Yong Jae Lee, M. S. Ryoo, Alireza Fathi, 2014.
Actor and Observer: Joint Modeling of First and Third-Person Videos - Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari, 2018.
Activity Classification from First-Person Office Videos with Visual Privacy Protection - Partho Ghosh, Md. Abrar Istiak, Nayeeb Rashid, Ahsan Habib Akash, Ridwan Abrar, Ankan Ghosh Dastider, Asif Shahriyar Sushmit & Taufiq Hasan, 2022.
Deep Dual Relation Modeling for Egocentric Interaction Recognition - Haoxin Li, Yijun Cai, Wei-Shi Zheng, 2019.
Going Deeper into First-Person Activity Recognition - Minghuang Ma, Haoqi Fan, Kris M. Kitani, 2016.
Together Recognizing, Localizing and Summarizing Actions in Egocentric Videos - Abhimanyu Sahu; Ananda S. Chowdhury, 2021.
Unifying Few- and Zero-Shot Egocentric Action Recognition - Tyler R. Scott, Michael Shvartsman, Karl Ridgeway, 2020.
Text Synopsis Generation for Egocentric Videos - Aidean Sharghi, Niels da Vitoria Lobo, Mubarak Shah, 2020.
Actor and Observer: Joint Modeling of First and Third-Person Videos - Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari, 2018.
Egocentric Video Task Translation - Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani, 2022.
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment - Zihui Xue, Kristen Grauman, 2023.
Recognizing Micro-Actions and Reactions from Paired Egocentric Videos - Ryo Yonetani, Kris M. Kitani, Yoichi Sato, 2016.
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model - Chuhan Zhang, Ankush Gupta, Andrew Zisserman, 2023.
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition - Guangzhao Dai, Xiangbo Shu, Wenhao Wu, 2024.
User-adaptive Models for Recognizing Food Preparation Activities - Sebastian Stein, Stephen J. McKenna, 2013.
A Hybrid Visual Transformer for Efficient Deep Human Activity Recognition - Youcef Djenouri, Ahmed Nabil Belbachir, 2023.
EgoEnv: Human-centric environment representations from egocentric video - Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman, 2022.
Activity Classification from First-Person Office Videos with Visual Privacy Protection - Partho Ghosh, Md. Abrar Istiak, Nayeeb Rashid, Ahsan Habib Akash, Ridwan Abrar, Ankan Ghosh Dastider, Asif Shahriyar Sushmit & Taufiq Hasan, 2021.
Analyzing First-Person Stories Based on Socializing, Eating and Sedentary Patterns - Pedro Herruzo, Laura Portell, Alberto Soto, Beatriz Remeseiro, 2017.
Pixel-Level Hand Detection with Shape-Aware Structured Forests - Xiaolong Zhu, Xuhui Jia & Kwan-Yee K. Wong, 2015.
Activity Recognition in Egocentric Life-Logging Videos - Sibo Song, Vijay Chandrasekhar, Ngai-Man Cheung, Sanath Narayan, Liyuan Li & Joo-Hwee Lim, 2015.
Egocentric Scene Understanding via Multimodal Spatial Rectifier - Tien Do, Khiem Vuong, Hyun Soo Park, 2022.
Ego-Deliver: A Large-Scale Benchmark For Egocentric Video Analysis -Haonan Qiu, Pan He, Shuchun Liu, Weiyuan Shao, Feiyun Zhang, Jiajun Wang, Liang He, Feng Wang, 2021.
Is First Person Vision Challenging for Object Tracking? - Matteo Dunnhofer, Antonino Furnari, Giovanni Maria Farinella, Christian Micheloni, 2021.
Making Third Person Techniques Recognize First-Person Actions in Egocentric Videos - Sagar Verma, Pravin Nagar, Divam Gupta, Chetan Arora, 2019.
From Third Person to First Person: Dataset and Baselines for Synthesis and Retrieval - Mohamed Elfeki, Krishna Regmi, Shervin Ardeshir, Ali Borji, 2018.
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos - Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman, 2021.
Understanding Social Relationships in Egocentric Vision - Stefano Alletto, Giuseppe Serra, Simone Calderara, Rita Cucchiara, 2015.
First Person Action Recognition Using Deep Learned Descriptors - Suriya Singh,Chetan Arora, C. V. Jawahar, 2016.
Multi-Dataset, Multitask Learning of EgocentricVision Tasks - Georgios Kapidis, Ronald Poppe, Remco C. Veltkamp, 2021.

Action Anticipation

On the Efficacy of Text-Based Input Modalities for Action Anticipation - Apoorva Beedu, Karan Samel, Irfan Essa, 2024
Anticipating human actions by correlating past with the future with Jaccard similarity measures - Basura Fernando, Samitha Herath; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13224-13233
Towards Streaming Egocentric Action Anticipation - Antonino Furnari, Giovanni Maria Farinella, 26th International Conference on Pattern Recognition (ICPR) 2022
What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention - Antonino Furnari, Giovanni Maria Farinella; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6252-6261
Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video - Antonino Furnari, Giovanni Maria Farinella, 2020.
Predicting the Future: A Jointly Learnt Model for Action Anticipation - Harshala Gammulle, Simon Denman, Sridha Sridharan, Clinton Fookes; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5562-5571
Anticipative Video Transformer - Rohit Girdhar, Kristen Grauman; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13505-13515
Leveraging the Present to Anticipate the Future in Videos - Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani, Du Tran; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019, pp. 0-0
An Outlook into the Future of Egocentric Vision - Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, Giovanni Maria Farinella, Dima Damen, Tatiana Tommasi, 2023.
Action Anticipation By Predicting Future Dynamic Images - Cristian Rodriguez, Basura Fernando, Hongdong Li; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0
Anticipating Visual Representations from Unlabeled Video - Carl Vondrick, Hamed Pirsiavash, Antonio Torralba; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 98-106
Object-centric Video Representation for Long-term Action Anticipation - Ce Zhang, Changcheng Fu, Shijie Wang, Nakul Agarwal, Kwonjoon Lee, Chiho Choi, Chen Sun; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6751-6761
Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation - Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen, Jürgen Beyerer; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 6068-6077
DIFFANT: Diffusion Models for Action Anticipation - Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer, 2023.
Knowledge Distillation for Action Anticipation via Label Smoothing - Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, Lamberto Ballan, 2020.
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation - Congqi Cao, Ze Sun, Qinyi Lv, Lingtong Min, Yanning Zhang,2023.
Seeing and Hearing Egocentric Actions: How Much CanWe Learn? - Alejandro Cartas, Jordi Luque, Petia Radeva, Carlos Segura, Mariella Dimiccoli,2019.
Anticipating human actions by correlating past with the future with Jaccard similarity measures - Basura Fernando, Samitha Herath,2021.
Recognizing Personal Locations From Egocentric Videos - Antonino Furnari, Giovanni Maria Farinella, Sebastiano Battiato, 2016.
Egocentric Audio-Visual Object Localization - Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu, 2023.
With a Little Help from my Temporal Context:Multimodal Egocentric Action Recognition - Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen, 2021.
Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting - Tianshan Liu; Kin-Man Lam, 2022.
Joint HandMotion and Interaction Hotspots Prediction from Egocentric Videos - Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, Xiaolong Wang, 2022.
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation - Razvan-George Pasca, Alexey Gavryushin, Yen-Ling Kuo, Luc Van Gool, Otmar Hilliges, Xi Wang, 2023.
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos - Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue, 2023.
Memory-and-Anticipation Transformer for Online Action Understanding - Jiahao Wang, Guo Chen, Yifei Huang, Limin Wang, Tong Lu, 2023.
Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos - Olga Zatsarynna, Yazan Abu Farha, Juergen Gall, 2021.
What IfWe Could Not See? Counterfactual Analysis for Egocentric Action Anticipation - Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, Yong Rui, 2021.

Deep Learning Methodologies and Architectures

Geneal and Basic Methodologies

Fast R-CNN - Ross Girshick, 2015.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks -Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, 2015.
Mask R-CNN - Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, 2017.
Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2015.
Visualizing and Understanding Convolutional Networks - Matthew D Zeiler, Rob Fergus, 2013.

Pre-Trained Models

Optical Flow

Traditional and modern strategies for optical flow: an investigation - Syed Tafseer Haider Shah & Xiang Xuezhi, 2021.

Transformers

Attention Is All You Need - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017.
A review on the attention mechanism of deep learning - Zhaoyang Niu, Guoqiang Zhong, Hui Yu, 2021.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018.

Low Light

Attention-Based Deep Learning Framework for Action Recognition in a Dark Environment - Muhammad MunsifMuhammad MunsifSamee Ullah KhanSamee Ullah KhanNoman KhanNoman KhanSung Wook BaikSung Wook Baik, 2024
Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study - Yuecong Xu, Jianfei Yang, Haozhi Cao, Jianxiong Yin, Zhenghua Chen, Xiaoli Li, Zhengguo Li, Qianwen Xu, 2023.
Delta Sampling R-BERT for limited data and low-light action recognition - Sanchit Hira, Ritwik Das, Abhinav Modi, Daniil Pakhomov, 2023.
Human Action Recognition in Dark Videos - Harsh Rakesh Patel, Jash Tejaskumar Doshi, 2021.
IndGIC: Supervised Action Recognition under Low Illumination - Jingbo Zeng, 2023.
Adaptive Unfolding Total Variation Network for Low-Light Image Enhancement - Chuanjun Zheng, Daming Shi, Wentian Shi, 2021.

Transfer Learning

TransNet: A Transfer Learning-Based Network for Human Action Recognition - K. Alomar, X. Cai, 2023.

Datasets

[IndustReal] - The IndustReal dataset contains 84 videos, demonstrating how 27 participants perform maintenance and assembly procedures on a construction-toy assembly set. WACV 2024. [paper]
ENIGMA-51 - ENIGMA-51 is a new egocentric dataset acquired in an industrial scenario by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and equipments (e.g., oscilloscope). The 51 egocentric video sequences are densely annotated with a rich set of labels that enable the systematic study of human behavior in the industrial domain. WACV 2023. [paper]
VidChapters-7M - VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. NeurIPS 2023. [paper]
POV-Surgery - POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329 frames, featuring high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks. MICCAI 2023. [paper]
ARGO1M - Action Recognition Generalisation dataset (ARGO1M) from videos and narrations from Ego4D. ARGO1M is the first to test action generalisation across both scenario and location shifts, and is the largest domain generalisation dataset across images and video. ICCV 2023. [paper]
[EgoObjects] - EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. ICCV 2023. [paper]
HoloAssist - HoloAssist: a large-scale egocentric human interaction dataset that spans 166 hours of data captured by 350 unique instructor-performer pairs, wearing mixed-reality headsets during collaborative tasks. ICCV 2023. [paper]
AssemblyHands - AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging handobject interactions. CVPR 2023. [paper]
EpicSoundingObject - Epic Sounding Object dataset with sounding object annotations to benchmark the localization performance in egocentric videos. CVPR 2023. [paper]
VOST - Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. CVPR 2023. [paper]
Aria Digital Twin - Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers. Very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. 2023. [paper]
WEAR - The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. 2023. [paper]
EPIC Fields - EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. 2023. [paper]
[EGOFALLS] - The dataset comprises 10,948 video samples from 14 subjects, focusing on falls among the elderly. Extracting multimodal descriptors from egocentric camera videos. 2023. [paper]
[Exo2EgoDVC] - EgoYC2, a novel egocentric dataset, adapts procedural captions from YouCook2 to cooking videos re-recorded with head-mounted cameras. Unique in its weakly-paired approach, it aligns caption content with exocentric videos, distinguishing itself from other datasets focused on action labels. 2023. [paper]
EgoWholeBody - EgoWholeBody, a large synthetic dataset, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera. 2023. [paper]
Ego-Exo4D - Ego-Exo4D, a vast multimodal multiview video dataset capturing skilled human activities in both egocentric and exocentric perspectives (e.g., sports, music, dance). With 800+ participants in 13 cities, it offers 1,422 hours of combined footage, featuring diverse activities in 131 natural scene contexts, ranging from 1 to 42 minutes per video. 2023. [paper]
[IT3DEgo] - IT3DEgo dataset: Addresses 3D instance tracking using egocentric sensors (AR/VR). Recorded in diverse indoor scenes with HoloLens2, it comprises 50 recordings (5+ minutes each). Evaluates tracking performance in 3D coordinates, leveraging camera pose and allocentric representation. 2023. [paper]
Touch and Go - we present a dataset, called Touch and Go, in which human data collectors walk through a variety of environments, probing objects with tactile sensors and simultaneously recording their actions on video. NeurIPS 2022. [paper]
EPIC-Visor - VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. NeurIPS 2022. [paper]
AssistQ - A new dataset comprising 529 question-answer samples derived from 100 newly filmed first-person videos. Each question should be completed with multi-step guidances by inferring from visual details (e.g., buttons' position) and textural details (e.g., actions like press/turn). ECCV 2022. [paper]
EgoProceL - EgoProceL dataset focuses on the key-steps required to perform a task instead of every action in the video. EgoProceL consistis of 62 hours of videos captured by 130 subjects performing 16 tasks. ECCV 2022. [paper]
EgoHOS - EgoHOS, a labeled dataset consisting of 11,243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with during a diverse array of daily activities. Our dataset is the first to label detailed hand-object contact boundaries. ECCV 2022. [paper]
UnrealEgo - UnrealEgo, i.e., a new large-scale naturalistic dataset for egocentric 3D human pose estimation. It is the first dataset to provide in-the-wild stereo images with the largest variety of motions among existing egocentric datasets. ECCV 2022. [paper]
Assembly101 - Procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 “take-apart” toy vehicles. CVPR 2022. [paper]
EgoPAT3D - A large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, with evaluation metrics based on high-quality 2D and 3D labels from semi-automatic annotation. CVPR 2022. [paper]
AGD20K - Affordance dataset constructed by collecting and labeling over 20K images from 36 affordance categories. CVPR 2022. [paper]
HOI4D - A large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction. HOI4D consists of 2.4M RGB-D egocentric video frames over 4000 sequences collected by 4 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms. Frame-wise annotations for panoptic segmentation, motion segmentation, 3D hand pose, category-level object pose and hand action have also been provided, together with reconstructed object meshes and scene point clouds. CVPR 2022. [paper]
EgoPW - A dataset captured by a head-mounted fisheye camera and an auxiliary external camera, which provides an additional observation of the human body from a third-person perspective. CVPR 2022. [paper]
Ego4D - 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. CVPR 2022. [paper]
N-EPIC-Kitchens - N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC-Kitchens dataset. CVPR 2022. [paper]
EPIC-SOUNDS - [paper]
EasyCom-Clustering - The first large-scale egocentric video face clustering dataset. 2022. [paper]
First2Third-Pose - A new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. 2022. [paper]
HMDB - A large human motion database [paper]
TREK-100 - Object tracking in first person vision. WICCV 2021. [paper]
MECCANO - 20 subject assembling a toy motorbike. WACV 2021. [paper]
EPIC-Kitchens 2020 - Subjects performing unscripted actions in their native environments. IJCV 2021. [paper]
H2O - H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds. ICCV 2021. [paper]
HOMAGE - Home Action Genome (HOMAGE): a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. CVPR 2021. [paper]
EgoCom - A natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. TPAMI 2020. [paper]
EGO-CH - 70 subjects visiting two cultural sites in Sicily, Italy. Pattern Recognition Letters 2020. [paper]
EPIC-Tent - 29 participants assembling a tent while wearing two head-mounted cameras. ICCV 2019. [paper]
EPIC-Kitchens 2018 - 32 subjects performing unscripted actions in their native environments. ECCV 2018. [paper]
Charades-Ego - Paired first-third person videos.
EGTEA Gaze+ - 32 subjects, 86 cooking sessions, 28 hours.
ADL - 20 subjects performing daily activities in their native environments.
CMU kitchen - Multimodal, 18 subjects cooking 5 different recipes: brownies, eggs, pizza, salad, sandwich.
EgoSeg - Long term actions (walking, running, driving, etc.).
First-Person Social Interactions - 8 subjects at disneyworld.
UEC Dataset - Two choreographed datasets with different egoactions (walk, jump, climb, etc.) + 6 YouTube sports videos.
JPL - Interaction with a robot.
FPPA - Five subjects performing 5 daily actions.
UT Egocentric - 3-5 hours long videos capturing a person's day.
VINST/ Visual Diaries - 31 videos capturing the visual experience of a subject walking from metro station to work.
Bristol Egocentric Object Interaction (BEOID) - 8 subjects, six locations. Interaction with objects and environment.
Object Search Dataset - 57 sequences of 55 subjects on search and retrieval tasks.
UNICT-VEDI - Different subjects visiting a museum.
UNICT-VEDI-POI - Different subjects visiting a museum.
Simulated Egocentric Navigations - Simulated navigations of a virtual agent within a large building.
EgoCart - Egocentric images collected by a shopping cart in a retail store.
Unsupervised Segmentation of Daily Living Activities - Egocentric videos of daily activities.
Visual Market Basket Analysis - Egocentric images collected by a shopping cart in a retail store.
Location Based Segmentation of Egocentric Videos - Egocentric videos of daily activities.
Recognition of Personal Locations from Egocentric Videos - Egocentric videos clips of daily.
EgoGesture - 2k videos from 50 subjects performing 83 gestures.
EgoHands - 48 videos of interactions between two people.
DoMSEV - 80 hours/different activities.
DR(eye)VE - 74 videos of people driving.
THU-READ - 8 subjects performing 40 actions with a head-mounted RGBD camera.
EgoDexter - 4 sequences with 4 actors (2 female), and varying interactions with various objects and and cluttered background. [paper]
First-Person Hand Action (FPHA) - 3D hand-object interaction. Includes 1175 videos belonging to 45 different activity categories performed by 6 actors. [paper]
UTokyo Paired Ego-Video (PEV) - 1,226 pairs of first-person clips extracted from the ones recorded synchronously during dyadic conversations.
UTokyo Ego-Surf - Contains 8 diverse groups of first-person videos recorded synchronously during face-to-face conversations.
TEgO: Teachable Egocentric Objects Dataset - Contains egocentric images of 19 distinct objects taken by two people for training a teachable object recognizer.
Multimodal Focused Interaction Dataset - Contains 377 minutes of continuous multimodal recording captured during 19 sessions, with 17 conversational partners in 18 different indoor/outdoor locations.
Online Action Detection - Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees Snoek & Tinne Tuytelaars, 2016.
Breakfast - [paper]
AVA Kinetics - [paper]
TV Human Interactions Dataset
FPV-O - [paper]
BON - [paper]
EgoTracks - [paper]
EGOK360 - [paper]

Challenges

Ego4D- Episodic Memory, Hand-Object Interactions, AV Diarization, Social, Forecasting.
Epic Kithchen Challenge- Action Recognition, Action Detection, Action Anticipation, Unsupervised Domain Adaptation for Action Recognition, Multi-Instance Retrieval.
MECCANO- Multimodal Action Recognition (RGB-Depth-Gaze).
THUMOS - The THUMOS challenge on action recognition for videos "in the wild"

Theses and dissertations

Action Recognition in Videos using Supervised and Self-Supervised Deep Learning Techniques - Benavent-Lledó, Manuel, 2022.

amitsou / DeepEgoHA-3R

DeepEgoHA-3R

Deep Egocentric Human Action Anticipation & Activity Recognition Research Publications

Inspiration and Credits

Sections

Surveys

Egocentric Vision

Action Anticipation

Deep Learning Methodologies and Architectures

Geneal and Basic Methodologies

Pre-Trained Models

Optical Flow

Transformers

Low Light

Transfer Learning

Datasets

Challenges

Theses and dissertations

About