amitsou / DeepEgoHA-3R

A collection of the forefront of Egocentric Human Activity Recognition (HAR) and Action Anticipation through Deep Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DeepEgoHA-3R

Deep Egocentric Human Action Anticipation & Activity Recognition Research Publications

Welcome to the DeepEgoHA^3R repository! Explore the forefront of Egocentric Human Activity Recognition (HAR) and Action Anticipation through Deep Learning. This collection features cutting-edge publications in multimodal computer vision, a focus of ongoing Ph.D. research at "the Vision Lab" at the University of Thessaly, Lamia. Dive into advancements in egocentric vision, particularly HAR and action anticipation. Contribute to our ongoing research by adding publications or insights. Let's collaborate at the intersection of deep learning and egocentric vision. Stay tuned for updates as we collectively navigate this captivating exploration!

Inspiration and Credits

This repository drew inspiration from Egocentric Vision repository, which provided valuable insights and ideas for exploring Egocentric Human Action Anticipation & Activity Recognition. We extend our gratitude to the contributors and researchers behind that repository for their work in advancing the field.

Sections

Surveys

Egocentric Vision

Action Anticipation

Deep Learning Methodologies and Architectures

Geneal and Basic Methodologies

Pre-Trained Models

Optical Flow

Transformers

Low Light

Transfer Learning

Datasets

  • [IndustReal] - The IndustReal dataset contains 84 videos, demonstrating how 27 participants perform maintenance and assembly procedures on a construction-toy assembly set. WACV 2024. [paper]
  • ENIGMA-51 - ENIGMA-51 is a new egocentric dataset acquired in an industrial scenario by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and equipments (e.g., oscilloscope). The 51 egocentric video sequences are densely annotated with a rich set of labels that enable the systematic study of human behavior in the industrial domain. WACV 2023. [paper]
  • VidChapters-7M - VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. NeurIPS 2023. [paper]
  • POV-Surgery - POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329 frames, featuring high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks. MICCAI 2023. [paper]
  • ARGO1M - Action Recognition Generalisation dataset (ARGO1M) from videos and narrations from Ego4D. ARGO1M is the first to test action generalisation across both scenario and location shifts, and is the largest domain generalisation dataset across images and video. ICCV 2023. [paper]
  • [EgoObjects] - EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. ICCV 2023. [paper]
  • HoloAssist - HoloAssist: a large-scale egocentric human interaction dataset that spans 166 hours of data captured by 350 unique instructor-performer pairs, wearing mixed-reality headsets during collaborative tasks. ICCV 2023. [paper]
  • AssemblyHands - AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging handobject interactions. CVPR 2023. [paper]
  • EpicSoundingObject - Epic Sounding Object dataset with sounding object annotations to benchmark the localization performance in egocentric videos. CVPR 2023. [paper]
  • VOST - Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. CVPR 2023. [paper]
  • Aria Digital Twin - Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers. Very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. 2023. [paper]
  • WEAR - The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. 2023. [paper]
  • EPIC Fields - EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. 2023. [paper]
  • [EGOFALLS] - The dataset comprises 10,948 video samples from 14 subjects, focusing on falls among the elderly. Extracting multimodal descriptors from egocentric camera videos. 2023. [paper]
  • [Exo2EgoDVC] - EgoYC2, a novel egocentric dataset, adapts procedural captions from YouCook2 to cooking videos re-recorded with head-mounted cameras. Unique in its weakly-paired approach, it aligns caption content with exocentric videos, distinguishing itself from other datasets focused on action labels. 2023. [paper]
  • EgoWholeBody - EgoWholeBody, a large synthetic dataset, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera. 2023. [paper]
  • Ego-Exo4D - Ego-Exo4D, a vast multimodal multiview video dataset capturing skilled human activities in both egocentric and exocentric perspectives (e.g., sports, music, dance). With 800+ participants in 13 cities, it offers 1,422 hours of combined footage, featuring diverse activities in 131 natural scene contexts, ranging from 1 to 42 minutes per video. 2023. [paper]
  • [IT3DEgo] - IT3DEgo dataset: Addresses 3D instance tracking using egocentric sensors (AR/VR). Recorded in diverse indoor scenes with HoloLens2, it comprises 50 recordings (5+ minutes each). Evaluates tracking performance in 3D coordinates, leveraging camera pose and allocentric representation. 2023. [paper]
  • Touch and Go - we present a dataset, called Touch and Go, in which human data collectors walk through a variety of environments, probing objects with tactile sensors and simultaneously recording their actions on video. NeurIPS 2022. [paper]
  • EPIC-Visor - VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. NeurIPS 2022. [paper]
  • AssistQ - A new dataset comprising 529 question-answer samples derived from 100 newly filmed first-person videos. Each question should be completed with multi-step guidances by inferring from visual details (e.g., buttons' position) and textural details (e.g., actions like press/turn). ECCV 2022. [paper]
  • EgoProceL - EgoProceL dataset focuses on the key-steps required to perform a task instead of every action in the video. EgoProceL consistis of 62 hours of videos captured by 130 subjects performing 16 tasks. ECCV 2022. [paper]
  • EgoHOS - EgoHOS, a labeled dataset consisting of 11,243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with during a diverse array of daily activities. Our dataset is the first to label detailed hand-object contact boundaries. ECCV 2022. [paper]
  • UnrealEgo - UnrealEgo, i.e., a new large-scale naturalistic dataset for egocentric 3D human pose estimation. It is the first dataset to provide in-the-wild stereo images with the largest variety of motions among existing egocentric datasets. ECCV 2022. [paper]
  • Assembly101 - Procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 “take-apart” toy vehicles. CVPR 2022. [paper]
  • EgoPAT3D - A large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, with evaluation metrics based on high-quality 2D and 3D labels from semi-automatic annotation. CVPR 2022. [paper]
  • AGD20K - Affordance dataset constructed by collecting and labeling over 20K images from 36 affordance categories. CVPR 2022. [paper]
  • HOI4D - A large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction. HOI4D consists of 2.4M RGB-D egocentric video frames over 4000 sequences collected by 4 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms. Frame-wise annotations for panoptic segmentation, motion segmentation, 3D hand pose, category-level object pose and hand action have also been provided, together with reconstructed object meshes and scene point clouds. CVPR 2022. [paper]
  • EgoPW - A dataset captured by a head-mounted fisheye camera and an auxiliary external camera, which provides an additional observation of the human body from a third-person perspective. CVPR 2022. [paper]
  • Ego4D - 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. CVPR 2022. [paper]
  • N-EPIC-Kitchens - N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC-Kitchens dataset. CVPR 2022. [paper]
  • EPIC-SOUNDS - [paper]
  • EasyCom-Clustering - The first large-scale egocentric video face clustering dataset. 2022. [paper]
  • First2Third-Pose - A new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. 2022. [paper]
  • HMDB - A large human motion database [paper]
  • TREK-100 - Object tracking in first person vision. WICCV 2021. [paper]
  • MECCANO - 20 subject assembling a toy motorbike. WACV 2021. [paper]
  • EPIC-Kitchens 2020 - Subjects performing unscripted actions in their native environments. IJCV 2021. [paper]
  • H2O - H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds. ICCV 2021. [paper]
  • HOMAGE - Home Action Genome (HOMAGE): a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. CVPR 2021. [paper]
  • EgoCom - A natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. TPAMI 2020. [paper]
  • EGO-CH - 70 subjects visiting two cultural sites in Sicily, Italy. Pattern Recognition Letters 2020. [paper]
  • EPIC-Tent - 29 participants assembling a tent while wearing two head-mounted cameras. ICCV 2019. [paper]
  • EPIC-Kitchens 2018 - 32 subjects performing unscripted actions in their native environments. ECCV 2018. [paper]
  • Charades-Ego - Paired first-third person videos.
  • EGTEA Gaze+ - 32 subjects, 86 cooking sessions, 28 hours.
  • ADL - 20 subjects performing daily activities in their native environments.
  • CMU kitchen - Multimodal, 18 subjects cooking 5 different recipes: brownies, eggs, pizza, salad, sandwich.
  • EgoSeg - Long term actions (walking, running, driving, etc.).
  • First-Person Social Interactions - 8 subjects at disneyworld.
  • UEC Dataset - Two choreographed datasets with different egoactions (walk, jump, climb, etc.) + 6 YouTube sports videos.
  • JPL - Interaction with a robot.
  • FPPA - Five subjects performing 5 daily actions.
  • UT Egocentric - 3-5 hours long videos capturing a person's day.
  • VINST/ Visual Diaries - 31 videos capturing the visual experience of a subject walking from metro station to work.
  • Bristol Egocentric Object Interaction (BEOID) - 8 subjects, six locations. Interaction with objects and environment.
  • Object Search Dataset - 57 sequences of 55 subjects on search and retrieval tasks.
  • UNICT-VEDI - Different subjects visiting a museum.
  • UNICT-VEDI-POI - Different subjects visiting a museum.
  • Simulated Egocentric Navigations - Simulated navigations of a virtual agent within a large building.
  • EgoCart - Egocentric images collected by a shopping cart in a retail store.
  • Unsupervised Segmentation of Daily Living Activities - Egocentric videos of daily activities.
  • Visual Market Basket Analysis - Egocentric images collected by a shopping cart in a retail store.
  • Location Based Segmentation of Egocentric Videos - Egocentric videos of daily activities.
  • Recognition of Personal Locations from Egocentric Videos - Egocentric videos clips of daily.
  • EgoGesture - 2k videos from 50 subjects performing 83 gestures.
  • EgoHands - 48 videos of interactions between two people.
  • DoMSEV - 80 hours/different activities.
  • DR(eye)VE - 74 videos of people driving.
  • THU-READ - 8 subjects performing 40 actions with a head-mounted RGBD camera.
  • EgoDexter - 4 sequences with 4 actors (2 female), and varying interactions with various objects and and cluttered background. [paper]
  • First-Person Hand Action (FPHA) - 3D hand-object interaction. Includes 1175 videos belonging to 45 different activity categories performed by 6 actors. [paper]
  • UTokyo Paired Ego-Video (PEV) - 1,226 pairs of first-person clips extracted from the ones recorded synchronously during dyadic conversations.
  • UTokyo Ego-Surf - Contains 8 diverse groups of first-person videos recorded synchronously during face-to-face conversations.
  • TEgO: Teachable Egocentric Objects Dataset - Contains egocentric images of 19 distinct objects taken by two people for training a teachable object recognizer.
  • Multimodal Focused Interaction Dataset - Contains 377 minutes of continuous multimodal recording captured during 19 sessions, with 17 conversational partners in 18 different indoor/outdoor locations.
  • Online Action Detection - Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees Snoek & Tinne Tuytelaars, 2016.
  • Breakfast - [paper]
  • AVA Kinetics - [paper]
  • TV Human Interactions Dataset
  • FPV-O - [paper]
  • BON - [paper]
  • EgoTracks - [paper]
  • EGOK360 - [paper]

Challenges

  • Ego4D- Episodic Memory, Hand-Object Interactions, AV Diarization, Social, Forecasting.
  • Epic Kithchen Challenge- Action Recognition, Action Detection, Action Anticipation, Unsupervised Domain Adaptation for Action Recognition, Multi-Instance Retrieval.
  • MECCANO- Multimodal Action Recognition (RGB-Depth-Gaze).
  • THUMOS - The THUMOS challenge on action recognition for videos "in the wild"

Theses and dissertations

About

A collection of the forefront of Egocentric Human Activity Recognition (HAR) and Action Anticipation through Deep Learning