mxmdpc / HOI-Learning-List

A list of Human-Object Interaction Learning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HOI-Learning-List

Some recent (2015-now) Human-Object Interaction Learing studies. If you find any errors or problems, please feel free to comment.

A list of Transfomer-based vision works: https://github.com/DirtyHarryLYL/Transformer-in-Vision.

Dataset/Benchmark

More...

Video HOI Datasets

Method

HOI Image Generation

  • Exploiting Relationship for Complex-scene Image Generation (arXiv 2021.04) [Paper]

  • Specifying Object Attributes and Relations in Interactive Scene Generation (arXiv 2019.11) [Paper]

HOI Recognition: Image-based, to recognize all the HOIs in one image.

More...

Unseen or zero-shot learning (image-level recognition).

  • ICompass (ICCV2021) [Paper], [Code]

  • Compositional Learning for Human Object Interaction (ECCV2018) [Paper]

  • Zero-Shot Human-Object Interaction Recognition via Affordance Graphs (Sep. 2020) [Paper]

More...

HOI Detection: Instance-based, to detect the human-object pairs and classify the interactions.

More...

Unseen or zero/low-shot or weakly-supervised learning (instance-level detection).

  • Align-Former (BMVC 2021), [Paper]

  • Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection (ICCV2021) [Paper], [Code]

  • DGIG-Net (TOC2021) [Paper]

  • ATL (CVPR2021) [Paper], [Code]

  • FCL (CVPR2021) [Paper], [Code]

  • Detecting Human-Object Interaction with Mixed Supervision (WACV 2021) [Paper]

  • ConsNet (ACMMM2020) [Paper] [Code]

  • Zero-Shot Human-Object Interaction Recognition via Affordance Graphs (Sep. 2020) [Paper]

  • VCL (ECCV2020) [Paper] [Code]

  • HOID (CVPR2020) [Code] [Paper]

  • Novel Human-Object Interaction Detection via Adversarial Domain Generalization (May. 2020) [Paper]

  • Analogy (ICCV2019) [Code] [Paper]

  • Functional (AAAI2020) [Paper]

  • Scaling Human-Object Interaction Recognition through Zero-Shot Learning (WACV2018) [Paper]

More...

Video HOI methods

  • Detecting Human-Object Relationships in Videos (ICCV2021) [Paper]

  • STIGPN (Aug 2021), [Paper], [Code]

  • VidHOI (May 2021), [Paper]

  • LIGHTEN (ACMMM2020) [Paper] [Code]

  • Generating Videos of Zero-Shot Compositions of Actions and Objects (Jul 2020), HOI GAN, [Paper]

  • Grounded Human-Object Interaction Hotspots from Video (ICCV2019) [Code] [Paper]

  • GPNN (ECCV2018) [Code] [Paper]

More...

Result

Proposed by TIN (TPAMI version, Transferable Interactiveness Network). It is built on HAKE data, includes 110K+ images and 520 HOIs (without the 80 "no_interaction" HOIs of HICO-DET to avoid the incomplete labeling). It has a more severe long-tailed data distribution thus is more difficult.

Detector: COCO pre-trained

Method mAP
iCAN 11.00
iCAN+NIS 13.13
TIN 15.38

HICO-DET:

1) Detector: COCO pre-trained

Method Pub Full(def) Rare(def) None-Rare(def) Full(ko) Rare(ko) None-Rare(ko)
Shen et al. WACV2018 6.46 4.24 7.12 - - -
HO-RCNN WACV2018 7.81 5.37 8.54 10.41 8.94 10.85
InteractNet CVPR2018 9.94 7.16 10.77 - - -
Turbo AAAI2019 11.40 7.30 12.60 - - -
GPNN ECCV2018 13.11 9.34 14.23 - - -
Xu et. al ICCV2019 14.70 13.26 15.13 - - -
iCAN BMVC2018 14.84 10.45 16.15 16.26 11.33 17.73
Wang et. al. ICCV2019 16.24 11.16 17.75 17.73 12.78 19.21
Lin et. al IJCAI2020 16.63 11.30 18.22 19.22 14.56 20.61
Functional (suppl) AAAI2020 16.96 11.73 18.52 - - -
Interactiveness CVPR2019 17.03 13.42 18.11 19.17 15.51 20.26
No-Frills ICCV2019 17.18 12.17 18.68 - - -
RPNN ICCV2019 17.35 12.78 18.71 - - -
PMFNet ICCV2019 17.46 15.65 18.00 20.34 17.47 21.20
SIGN ICME2020 17.51 15.31 18.53 20.49 17.53 21.51
Interactiveness-optimized CVPR2019 17.54 13.80 18.65 19.75 15.70 20.96
Liu et.al. arXiv 17.55 20.61 - - - -
Wang et al. ECCV2020 17.57 16.85 17.78 21.00 20.74 21.08
In-GraphNet IJCAI-PRICAI 2020 17.72 12.93 19.31 - - -
HOID CVPR2020 17.85 12.85 19.34 - - -
MLCNet ICMR2020 17.95 16.62 18.35 22.28 20.73 22.74
SAG arXiv 18.26 13.40 19.71 - - -
Sarullo et al. arXiv 18.74 - - - - -
DRG ECCV2020 19.26 17.74 19.71 23.40 21.75 23.89
Analogy ICCV2019 19.40 14.60 20.90 - - -
VCL ECCV2020 19.43 16.55 20.29 22.00 19.09 22.87
VS-GATs arXiv 19.66 15.79 20.81 - - -
VSGNet CVPR2020 19.80 16.05 20.91 - - -
PFNet CVM 20.05 16.66 21.07 24.01 21.09 24.89
ATL(w/ affordance) CVPR2021 20.08 15.57 21.43 - - -
ATL CVPR2021 21.07 16.79 22.35 - - -
FCMNet ECCV2020 20.41 17.34 21.56 22.04 18.97 23.12
ACP ECCV2020 20.59 15.92 21.98 - - -
PD-Net ECCV2020 20.81 15.90 22.28 24.78 18.88 26.54
SG2HOI ICCV2021 20.93 18.24 21.78 24.83 20.52 25.32
TIN-PAMI TAPMI2021 20.93 18.95 21.32 23.02 20.96 23.42
PMN arXiv 21.21 17.60 22.29 - - -
IPGN TIP2021 21.26 18.47 22.07 - - -
DJ-RN CVPR2020 21.34 18.53 22.18 23.69 20.64 24.60
OSGNet IEEE Access 21.40 18.12 22.38 - - -
DIRV AAAI2021 21.78 16.38 23.39 25.52 20.84 26.92
SCG ICCV2021 21.85 18.11 22.97 - - -
HRNet TIP2021 21.93 16.30 23.62 25.22 18.75 27.15
ConsNet ACMMM2020 22.15 17.55 23.52 26.57 20.8 28.3
IDN NeurIPS2020 23.36 22.47 23.63 26.43 25.01 26.85
QAHOI-Res50 arXiv2021 24.35 16.18 26.80 - - -

2) Detector: pre-trained on COCO, fine-tuned on HICO-DET train set (with GT human-object pair boxes) or one-stage detector (point-based, transformer-based)

Finetuned detector would learn to only detect the interactive humans and objects (with interactiveness), thus suppress many wrong pairings (non-interactive human-object pairs) and boost the performance.

Method Pub Full(def) Rare(def) None-Rare(def) Full(ko) Rare(ko) None-Rare(ko)
UniDet ECCV2020 17.58 11.72 19.33 19.76 14.68 21.27
IP-Net CVPR2020 19.56 12.79 21.58 22.05 15.77 23.92
RR-Net arXiv 20.72 13.21 22.97 - - -
PPDM (paper) CVPR2020 21.10 14.46 23.09 - - -
PPDM (github-hourglass104) CVPR2020 21.73/21.94 13.78/13.97 24.10/24.32 24.58/24.81 16.65/17.09 26.84/27.12
Functional AAAI2020 21.96 16.43 23.62 - - -
SABRA-Res50 arXiv 23.48 16.39 25.59 28.79 22.75 30.54
VCL ECCV2020 23.63 17.21 25.55 25.98 19.12 28.03
PST ICCV2021 23.93 14.98 26.60 26.42 17.61 29.05
SABRA-Res50FPN arXiv 24.12 15.91 26.57 29.65 22.92 31.65
DRG ECCV2020 24.53 19.47 26.04 27.98 23.11 29.43
HOTR CVPR2021 25.10 17.34 27.42 - - -
ConsNet-F ACMMM2020 25.94 19.35 27.91 30.34 23.4 32.41
SABRA-Res152 arXiv 26.09 16.29 29.02 31.08 23.44 33.37
QAHOI-Res50 arXiv2021 26.18 18.06 28.61 - - -
IDN NeurIPS2020 26.29 22.61 27.39 28.24 24.47 29.37
Zou et al. CVPR2021 26.61 19.15 28.84 29.13 20.98 31.57
ATL CVPR2021 27.68 20.31 29.89 30.05 22.40 32.34
GTNet arXiv 28.03 22.73 29.61 29.98 24.13 31.73
ATL(w/ affordance) CVPR2021 28.53 21.64 30.59 31.18 24.15 33.29
AS-Net CVPR2021 28.87 24.25 30.25 31.74 27.07 33.14
QPIC-Res50 CVPR2021 29.07 21.85 31.23 31.68 24.14 33.93
FCL CVPR2021 29.12 23.67 30.75 31.31 25.62 33.02
GGNet CVPR2021 29.17 22.13 30.84 33.50 26.67 34.89
QPIC-Res101 CVPR2021 29.90 23.92 31.69 32.38 26.06 34.27
SCG ICCV2021 29.26 24.61 30.65 32.87 27.89 34.35
PhraseHOI AAAI2022 30.03 23.48 31.99 33.74 27.35 35.64
OCN AAAI2022 31.43 25.80 33.11 65.3 67.1
CDN NeurIPS2021 32.07 27.19 33.53 34.79 29.48 36.38
DEFR arXiv2021 32.35 33.45 32.02 - - -
UPT arXiv2021 32.62 28.62 33.81 36.08 31.41 37.47
QAHOI-Swin-Large-ImageNet-22K arXiv2021 35.78 29.80 37.56 37.59 31.66 39.36

3) Ground Truth human-object pair boxes (only evaluating HOI recognition)

Method Pub Full(def) Rare(def) None-Rare(def)
iCAN BMVC2018 33.38 21.43 36.95
Interactiveness CVPR2019 34.26 22.90 37.65
Analogy ICCV2019 34.35 27.57 36.38
ATL CVPR2021 43.32 33.84 46.15
IDN NeurIPS2020 43.98 40.27 45.09
ATL(w/ affordance) CVPR2021 44.27 35.52 46.89
FCL CVPR2021 45.25 36.27 47.94
GTNet arXiv 46.45 35.10 49.84
SCG ICCV2021 51.53 41.01 54.67
ConsNet ACMMM2020 53.04 38.79 57.3

4) Enhanced with HAKE:

Method Pub Full(def) Rare(def) None-Rare(def) Full(ko) Rare(ko) None-Rare(ko)
iCAN BMVC2018 14.84 10.45 16.15 16.26 11.33 17.73
iCAN + HAKE-HICO-DET CVPR2020 19.61 (+4.77) 17.29 20.30 22.10 20.46 22.59
Interactiveness CVPR2019 17.03 13.42 18.11 19.17 15.51 20.26
Interactiveness + HAKE-HICO-DET CVPR2020 22.12 (+5.09) 20.19 22.69 24.06 22.19 24.62
Interactiveness + HAKE-Large CVPR2020 22.66 (+5.63) 21.17 23.09 24.53 23.00 24.99

5) Zero-Shot HOI detection:

Unseen action-object combination scenario (UC)
Method Pub Detector Full(def) Seen(def) Unseen(def)
Shen et al. WACV2018 COCO 6.26 - 5.62
Functional AAAI2020 HICO-DET 12.45 ± 0.16 12.74 ± 0.34 11.31 ± 1.03
VCL ECCV2020 HICO-DET 18.06 18.52 16.22
ATL(w/ affordance) CVPR2021 HICO-DET 18.67 18.78 18.25
FCL CVPR2021 HICO-DET 19.37 19.55 18.66
ConsNet ACMMM2020 COCO 19.81 ± 0.32 20.51 ± 0.62 16.99 ± 1.67
Unseen object scenario (UO)
Method Pub Detector Full(def) Seen(def) Unseen(def)
Functional AAAI2020 HICO-DET 13.84 14.36 11.22
FCL CVPR2021 HICO-DET 19.87 20.74 15.54
ConsNet ACMMM2020 COCO 20.71 20.99 19.27
Unseen action scenario (UA)
Method Pub Detector Full(def) Seen(def) Unseen(def)
ConsNet ACMMM2020 COCO 19.04 20.02 14.12

Detector: COCO pre-trained

Method mAP
iCAN 8.14
Interactiveness 8.22
Analogy(reproduced) 9.72
DJ-RN 10.37

V-COCO: Scenario1

1) Detector: COCO pre-trained or one-stage detector

Method Pub AP(role)
Gupta et al. arXiv 31.8
InteractNet CVPR2018 40.0
Turbo AAAI2019 42.0
GPNN ECCV2018 44.0
iCAN BMVC2018 45.3
Xu et. al CVPR2019 45.9
Wang et. al. ICCV2019 47.3
UniDet ECCV2020 47.5
Interactiveness CVPR2019 47.8
Lin et. al IJCAI2020 48.1
VCL ECCV2020 48.3
Zhou et. al. CVPR2020 48.9
In-GraphNet IJCAI-PRICAI 2020 48.9
Interactiveness-optimized CVPR2019 49.0
TIN-PAMI TAPMI2021 49.1
IP-Net CVPR2020 51.0
DRG ECCV2020 51.0
VSGNet CVPR2020 51.8
PMN arXiv 51.8
PMFNet ICCV2019 52.0
Liu et.al. arXiv 52.28
FCL CVPR2021 52.35
PD-Net ECCV2020 52.6
Wang et.al. ECCV2020 52.7
PFNet CVM 52.8
Zou et al. CVPR2021 52.9
SIGN ICME2020 53.1
ACP ECCV2020 52.98 (53.23)
FCMNet ECCV2020 53.1
HRNet TIP2021 53.1
ConsNet ACMMM2020 53.2
IDN NeurIPS2020 53.3
SG2HOI ICCV2021 53.3
OSGNet IEEE Access 53.43
SABRA-Res50 arXiv 53.57
IPGN TIP2021 53.79
AS-Net CVPR2021 53.9
RR-Net arXiv 54.2
SCG ICCV2021 54.2
SABRA-Res50FPN arXiv 54.69
GGNet CVPR2021 54.7
MLCNet ICMR2020 55.2
HOTR CVPR2021 55.2
DIRV AAAI2021 56.1
SABRA-Res152 arXiv 56.62
PhraseHOI AAAI2022 57.4
GTNet arXiv 58.29
QPIC-Res101 CVPR2021 58.3
QPIC-Res50 CVPR2021 58.8
UPT-ResNet-101-DC5 arXiv2021 61.3
CDN NeurIPS2021 63.91

2) Enhanced with HAKE:

Method Pub AP(role)
iCAN CVPR2019 45.3
iCAN + HAKE-Large (transfer learning) CVPR2020 49.2 (+3.9)
Interactiveness CVPR2019 47.8
Interactiveness + HAKE-Large (transfer learning) CVPR2020 51.0 (+3.2)

based on V-COCO

Method Pub Full Seen Unseen
VCL ECCV2020 23.53 8.29 35.36
ATL(w/ affordance) CVPR2021 23.40 8.01 35.34

HICO

1) Default

Method mAP
R*CNN 28.5
Girdhar et.al. 34.6
Mallya et.al. 36.1
Pairwise 39.9
DEFR-base 44.1
DEFR-CLIP 60.5
DEFR/16 CLIP 65.6

2) Enhanced with HAKE:

Method mAP
Mallya et.al. 36.1
Mallya et.al.+HAKE-HICO 45.0 (+8.9)
Pairwise 39.9
Pairwise+HAKE-HICO 45.9 (+6.0)
Pairwise+HAKE-Large 46.3 (+6.4)

About

A list of Human-Object Interaction Learning.