HOI-Learning-List

Some recent (2015-now) Human-Object Interaction Learing studies. If you find any errors or problems, please feel free to comment.

A list of Transfomer-based vision works: https://github.com/DirtyHarryLYL/Transformer-in-Vision.

Dataset/Benchmark

DIABOLO [Paper], [Website]
Bongard-HOI [Paper]
HOI-COCO (CVPR2021) [Website]
PaStaNet-HOI (TPAMI2021) [Benchmark]
HAKE (CVPR2020) [YouTube] [bilibili] [Website] [Paper] [HAKE-Action-Torch] [HAKE-Action-TF]
Ambiguous-HOI (CVPR2020) [Website] [Paper]
HICO-DET (WACV2018) [Website] [Paper]
HCVRD (AAAI2018) [Website] [Paper]
V-COCO (May 2015) [Website] [Paper]
HICO (ICCV2015) [Website] [Paper]
OpenImage [Website] [Paper]
PIC [Website]

More...

Video HOI Datasets

VidHOI [Paper]
AVA [Website], HOIs (human-object, human-human) and pose (body motion) actions
Action Genome [Website], action verbs and spatial relationships
CAD120 [Paper], [Website]
Sth-else [Paper], [Website]

Method

HOI Image Generation

Exploiting Relationship for Complex-scene Image Generation (arXiv 2021.04) [Paper]
Specifying Object Attributes and Relations in Interactive Scene Generation (arXiv 2019.11) [Paper]

HOI Recognition: Image-based, to recognize all the HOIs in one image.

DEFR (arXiv 2021.12) [Paper]
Interaction Compass (ICCV 2021) [Paper]
DEFR-CLIP (arXiv 2021.07) [Paper]
PaStaNet: Toward Human Activity Knowledge Engine (CVPR2020) [Code] [Data] [Paper] [YouTube] [bilibili]
Pairwise (ECCV2018) [Paper]
Attentional Pooling for Action Recognition (NIPS2017) [Code] [Paper]
Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering (ECCV2016) [Code] [Paper]
Contextual Action Recognition with R*CNN (ICCV2015) [Code] [Paper]
HOCNN (ICCV2015) [Code] [Paper]
SGAP-Net (AAAI2020) [Paper]

More...

Unseen or zero-shot learning (image-level recognition).

ICompass (ICCV2021) [Paper], [Code]
Compositional Learning for Human Object Interaction (ECCV2018) [Paper]
Zero-Shot Human-Object Interaction Recognition via Affordance Graphs (Sep. 2020) [Paper]

More...

HOI Detection: Instance-based, to detect the human-object pairs and classify the interactions.

OCN (AAAI 2022) [Paper], [Code]
QAHOI (arXiv 2021) [Paper], [Code]
PhraseHOI (AAAI 2022) [Paper]
DEFR (arXiv 2021.12) [Paper]
UPT (arXiv 2021) [Paper], [Code]
HRNet (TIP 2021) [Paper]
ACP++ (TIP 2021) [Paper], [Code]
SG2HOI (ICCV 2021) [Paper]
CDN (NeurIPS 2021) [Paper]
GTNet (arXiv 2021.8) [Paper], [Code]
HOI-MO-Net (IVC 2021) [Paper]
IPGN (TIP 2021.7) [Paper]
SCG (ICCV 2021, SAG, v2) [Paper], [Code]
Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior (arXiv) [Paper]
PST (ICCV2021) [Paper]
RR-Net (arXiv 2021.5) [Paper]
HOTR (CVPR2021) [Paper], [Code]
GGNet (CVPR2021) [Paper], [Code]
ATL (CVPR2021) [Paper], [Code]
FCL (CVPR2021) [Paper], [Code]
AS-Net (CVPR2021) [Paper], [Code]
End-to-End Human Object Interaction Detection with HOI Transformer (CVPR2021), [Paper], [Code]
QPIC (CVPR2021) [Paper], [Code]
TIN (TPAMI2021) [Paper], [Code]
IDN (NeurIPS2020) [Paper] [Code]
DIRV (AAAI2021) [Paper]
DecAug (AAAI2021) [Paper]
OSGNet (IEEE Access) [Paper]
PFNet (CVM) [Paper]
UniDet (ECCV2020) [Paper]
DRG (ECCV2020) [Paper] [Code]
FCMNet (ECCV2020) [Paper]
Contextual Heterogeneous Graph Network for Human-Object Interaction Detection (ECCV2020) [Paper]
PD-Net (ECCV2020) [Paper-1] [Paper-2] [Code]
VCL (ECCV2020) [Paper] [Code]
ACP (ECCV2020) [Paper] [Code]
ConsNet (ACMMM2020) [Paper] [Code], HICO-DET Python API: A general Python toolkit for the HICO-DET dataset, including APIs for data loading & processing, human-object pair IoU & NMS calculation, and standard evaluation. [Code] [Documentation]
Action-Guided Attention Mining and Relation Reasoning Network for Human-Object Interaction Detection (IJCAI2020) [Paper]
PaStaNet (CVPR2020) [Code] [Data] [Paper] [YouTube] [bilibili]
DJ-RN (CVPR2020) [Code] [Paper]
Cascaded Human-Object Interaction Recognition (CVPR2020) [Code] [Paper]
PPDM (CVPR2020) [Code] [Paper]
IP-Net (CVPR2020) [Code] [Paper]
VSGNet (CVPR2020) [Code] [Paper]
HOID (CVPR2020) [Code] [Paper]
Diagnosing Rarity in Human-Object Interaction Detection (CVPRW2020) [Paper]
MLCNet (ICMR2020) [Paper]
SIGN (ICME2020) [Paper]
In-GraphNet (IJCAI-PRICAI 2020) [Paper]
PMFNet(ICCV2019) [Code] [Paper]
No-Frills (ICCV2019) [Code] [Paper]
Analogy (ICCV2019) [Code] [Paper]
RPNN (ICCV2019) [Paper]
Deep Contextual Attention for Human-Object Interaction Detection (ICCV2019) [Paper]
Interactiveness (CVPR2019) [Code] [Paper]
Turbo (AAAI2019) [Paper]
GPNN (ECCV2018) [Code] [Paper]
iCAN (BMVC2018) [Code] [Paper]
InteractNet (CVPR2018) [Paper]
Scaling Human-Object Interaction Recognition through Zero-Shot Learning (WACV2018) [Paper]
HO-RCNN (WACV2018) [Code] [Paper]
VS-GATs (Mar. 2020) [Paper]
Classifying All Interacting Pairs in a Single Shot (Jan. 2020) [Paper]
Novel Human-Object Interaction Detection via Adversarial Domain Generalization (May. 2020) [Paper]
PMN (Jul. 2020) [Paper] [Code]
SAG (Dec 2020) [Paper] [Code]
SABRA (Dec 2020) [Paper]

More...

Unseen or zero/low-shot or weakly-supervised learning (instance-level detection).

Align-Former (BMVC 2021), [Paper]
Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection (ICCV2021) [Paper], [Code]
DGIG-Net (TOC2021) [Paper]
ATL (CVPR2021) [Paper], [Code]
FCL (CVPR2021) [Paper], [Code]
Detecting Human-Object Interaction with Mixed Supervision (WACV 2021) [Paper]
ConsNet (ACMMM2020) [Paper] [Code]
Zero-Shot Human-Object Interaction Recognition via Affordance Graphs (Sep. 2020) [Paper]
VCL (ECCV2020) [Paper] [Code]
HOID (CVPR2020) [Code] [Paper]
Novel Human-Object Interaction Detection via Adversarial Domain Generalization (May. 2020) [Paper]
Analogy (ICCV2019) [Code] [Paper]
Functional (AAAI2020) [Paper]
Scaling Human-Object Interaction Recognition through Zero-Shot Learning (WACV2018) [Paper]

More...

Video HOI methods

Detecting Human-Object Relationships in Videos (ICCV2021) [Paper]
STIGPN (Aug 2021), [Paper], [Code]
VidHOI (May 2021), [Paper]
LIGHTEN (ACMMM2020) [Paper] [Code]
Generating Videos of Zero-Shot Compositions of Actions and Objects (Jul 2020), HOI GAN, [Paper]
Grounded Human-Object Interaction Hotspots from Video (ICCV2019) [Code] [Paper]
GPNN (ECCV2018) [Code] [Paper]

More...

Result

PaStaNet-HOI:

Proposed by TIN (TPAMI version, Transferable Interactiveness Network). It is built on HAKE data, includes 110K+ images and 520 HOIs (without the 80 "no_interaction" HOIs of HICO-DET to avoid the incomplete labeling). It has a more severe long-tailed data distribution thus is more difficult.

Detector: COCO pre-trained

Method	mAP
iCAN	11.00
iCAN+NIS	13.13
TIN	15.38

HICO-DET:

1) Detector: COCO pre-trained

Method	Pub	Full(def)	Rare(def)	None-Rare(def)	Full(ko)	Rare(ko)	None-Rare(ko)
Shen et al.	WACV2018	6.46	4.24	7.12	-	-	-
HO-RCNN	WACV2018	7.81	5.37	8.54	10.41	8.94	10.85
InteractNet	CVPR2018	9.94	7.16	10.77	-	-	-
Turbo	AAAI2019	11.40	7.30	12.60	-	-	-
GPNN	ECCV2018	13.11	9.34	14.23	-	-	-
Xu et. al	ICCV2019	14.70	13.26	15.13	-	-	-
iCAN	BMVC2018	14.84	10.45	16.15	16.26	11.33	17.73
Wang et. al.	ICCV2019	16.24	11.16	17.75	17.73	12.78	19.21
Lin et. al	IJCAI2020	16.63	11.30	18.22	19.22	14.56	20.61
Functional (suppl)	AAAI2020	16.96	11.73	18.52	-	-	-
Interactiveness	CVPR2019	17.03	13.42	18.11	19.17	15.51	20.26
No-Frills	ICCV2019	17.18	12.17	18.68	-	-	-
RPNN	ICCV2019	17.35	12.78	18.71	-	-	-
PMFNet	ICCV2019	17.46	15.65	18.00	20.34	17.47	21.20
SIGN	ICME2020	17.51	15.31	18.53	20.49	17.53	21.51
Interactiveness-optimized	CVPR2019	17.54	13.80	18.65	19.75	15.70	20.96
Liu et.al.	arXiv	17.55	20.61	-	-	-	-
Wang et al.	ECCV2020	17.57	16.85	17.78	21.00	20.74	21.08
In-GraphNet	IJCAI-PRICAI 2020	17.72	12.93	19.31	-	-	-
HOID	CVPR2020	17.85	12.85	19.34	-	-	-
MLCNet	ICMR2020	17.95	16.62	18.35	22.28	20.73	22.74
SAG	arXiv	18.26	13.40	19.71	-	-	-
Sarullo et al.	arXiv	18.74	-	-	-	-	-
DRG	ECCV2020	19.26	17.74	19.71	23.40	21.75	23.89
Analogy	ICCV2019	19.40	14.60	20.90	-	-	-
VCL	ECCV2020	19.43	16.55	20.29	22.00	19.09	22.87
VS-GATs	arXiv	19.66	15.79	20.81	-	-	-
VSGNet	CVPR2020	19.80	16.05	20.91	-	-	-
PFNet	CVM	20.05	16.66	21.07	24.01	21.09	24.89
ATL(w/ affordance)	CVPR2021	20.08	15.57	21.43	-	-	-
ATL	CVPR2021	21.07	16.79	22.35	-	-	-
FCMNet	ECCV2020	20.41	17.34	21.56	22.04	18.97	23.12
ACP	ECCV2020	20.59	15.92	21.98	-	-	-
PD-Net	ECCV2020	20.81	15.90	22.28	24.78	18.88	26.54
SG2HOI	ICCV2021	20.93	18.24	21.78	24.83	20.52	25.32
TIN-PAMI	TAPMI2021	20.93	18.95	21.32	23.02	20.96	23.42
PMN	arXiv	21.21	17.60	22.29	-	-	-
IPGN	TIP2021	21.26	18.47	22.07	-	-	-
DJ-RN	CVPR2020	21.34	18.53	22.18	23.69	20.64	24.60
OSGNet	IEEE Access	21.40	18.12	22.38	-	-	-
DIRV	AAAI2021	21.78	16.38	23.39	25.52	20.84	26.92
SCG	ICCV2021	21.85	18.11	22.97	-	-	-
HRNet	TIP2021	21.93	16.30	23.62	25.22	18.75	27.15
ConsNet	ACMMM2020	22.15	17.55	23.52	26.57	20.8	28.3
IDN	NeurIPS2020	23.36	22.47	23.63	26.43	25.01	26.85
QAHOI-Res50	arXiv2021	24.35	16.18	26.80	-	-	-

2) Detector: pre-trained on COCO, fine-tuned on HICO-DET train set (with GT human-object pair boxes) or one-stage detector (point-based, transformer-based)

Finetuned detector would learn to only detect the interactive humans and objects (with interactiveness), thus suppress many wrong pairings (non-interactive human-object pairs) and boost the performance.

Method	Pub	Full(def)	Rare(def)	None-Rare(def)	Full(ko)	Rare(ko)	None-Rare(ko)
UniDet	ECCV2020	17.58	11.72	19.33	19.76	14.68	21.27
IP-Net	CVPR2020	19.56	12.79	21.58	22.05	15.77	23.92
RR-Net	arXiv	20.72	13.21	22.97	-	-	-
PPDM (paper)	CVPR2020	21.10	14.46	23.09	-	-	-
PPDM (github-hourglass104)	CVPR2020	21.73/21.94	13.78/13.97	24.10/24.32	24.58/24.81	16.65/17.09	26.84/27.12
Functional	AAAI2020	21.96	16.43	23.62	-	-	-
SABRA-Res50	arXiv	23.48	16.39	25.59	28.79	22.75	30.54
VCL	ECCV2020	23.63	17.21	25.55	25.98	19.12	28.03
PST	ICCV2021	23.93	14.98	26.60	26.42	17.61	29.05
SABRA-Res50FPN	arXiv	24.12	15.91	26.57	29.65	22.92	31.65
DRG	ECCV2020	24.53	19.47	26.04	27.98	23.11	29.43
HOTR	CVPR2021	25.10	17.34	27.42	-	-	-
ConsNet-F	ACMMM2020	25.94	19.35	27.91	30.34	23.4	32.41
SABRA-Res152	arXiv	26.09	16.29	29.02	31.08	23.44	33.37
QAHOI-Res50	arXiv2021	26.18	18.06	28.61	-	-	-
IDN	NeurIPS2020	26.29	22.61	27.39	28.24	24.47	29.37
Zou et al.	CVPR2021	26.61	19.15	28.84	29.13	20.98	31.57
ATL	CVPR2021	27.68	20.31	29.89	30.05	22.40	32.34
GTNet	arXiv	28.03	22.73	29.61	29.98	24.13	31.73
ATL(w/ affordance)	CVPR2021	28.53	21.64	30.59	31.18	24.15	33.29
AS-Net	CVPR2021	28.87	24.25	30.25	31.74	27.07	33.14
QPIC-Res50	CVPR2021	29.07	21.85	31.23	31.68	24.14	33.93
FCL	CVPR2021	29.12	23.67	30.75	31.31	25.62	33.02
GGNet	CVPR2021	29.17	22.13	30.84	33.50	26.67	34.89
QPIC-Res101	CVPR2021	29.90	23.92	31.69	32.38	26.06	34.27
SCG	ICCV2021	29.26	24.61	30.65	32.87	27.89	34.35
PhraseHOI	AAAI2022	30.03	23.48	31.99	33.74	27.35	35.64
OCN	AAAI2022	31.43	25.80	33.11	65.3	67.1
CDN	NeurIPS2021	32.07	27.19	33.53	34.79	29.48	36.38
DEFR	arXiv2021	32.35	33.45	32.02	-	-	-
UPT	arXiv2021	32.62	28.62	33.81	36.08	31.41	37.47
QAHOI-Swin-Large-ImageNet-22K	arXiv2021	35.78	29.80	37.56	37.59	31.66	39.36

3) Ground Truth human-object pair boxes (only evaluating HOI recognition)

Method	Pub	Full(def)	Rare(def)	None-Rare(def)
iCAN	BMVC2018	33.38	21.43	36.95
Interactiveness	CVPR2019	34.26	22.90	37.65
Analogy	ICCV2019	34.35	27.57	36.38
ATL	CVPR2021	43.32	33.84	46.15
IDN	NeurIPS2020	43.98	40.27	45.09
ATL(w/ affordance)	CVPR2021	44.27	35.52	46.89
FCL	CVPR2021	45.25	36.27	47.94
GTNet	arXiv	46.45	35.10	49.84
SCG	ICCV2021	51.53	41.01	54.67
ConsNet	ACMMM2020	53.04	38.79	57.3

4) Enhanced with HAKE:

Method	Pub	Full(def)	Rare(def)	None-Rare(def)	Full(ko)	Rare(ko)	None-Rare(ko)
iCAN	BMVC2018	14.84	10.45	16.15	16.26	11.33	17.73
iCAN + HAKE-HICO-DET	CVPR2020	19.61 (+4.77)	17.29	20.30	22.10	20.46	22.59
Interactiveness	CVPR2019	17.03	13.42	18.11	19.17	15.51	20.26
Interactiveness + HAKE-HICO-DET	CVPR2020	22.12 (+5.09)	20.19	22.69	24.06	22.19	24.62
Interactiveness + HAKE-Large	CVPR2020	22.66 (+5.63)	21.17	23.09	24.53	23.00	24.99

5) Zero-Shot HOI detection:

Unseen action-object combination scenario (UC)

Method	Pub	Detector	Full(def)	Seen(def)	Unseen(def)
Shen et al.	WACV2018	COCO	6.26	-	5.62
Functional	AAAI2020	HICO-DET	12.45 ± 0.16	12.74 ± 0.34	11.31 ± 1.03
VCL	ECCV2020	HICO-DET	18.06	18.52	16.22
ATL(w/ affordance)	CVPR2021	HICO-DET	18.67	18.78	18.25
FCL	CVPR2021	HICO-DET	19.37	19.55	18.66
ConsNet	ACMMM2020	COCO	19.81 ± 0.32	20.51 ± 0.62	16.99 ± 1.67

Unseen object scenario (UO)

Method	Pub	Detector	Full(def)	Seen(def)	Unseen(def)
Functional	AAAI2020	HICO-DET	13.84	14.36	11.22
FCL	CVPR2021	HICO-DET	19.87	20.74	15.54
ConsNet	ACMMM2020	COCO	20.71	20.99	19.27

Unseen action scenario (UA)

Method	Pub	Detector	Full(def)	Seen(def)	Unseen(def)
ConsNet	ACMMM2020	COCO	19.04	20.02	14.12

Ambiguous-HOI

Detector: COCO pre-trained

Method	mAP
iCAN	8.14
Interactiveness	8.22
Analogy(reproduced)	9.72
DJ-RN	10.37

V-COCO: Scenario1

1) Detector: COCO pre-trained or one-stage detector

Method	Pub	AP(role)
Gupta et al.	arXiv	31.8
InteractNet	CVPR2018	40.0
Turbo	AAAI2019	42.0
GPNN	ECCV2018	44.0
iCAN	BMVC2018	45.3
Xu et. al	CVPR2019	45.9
Wang et. al.	ICCV2019	47.3
UniDet	ECCV2020	47.5
Interactiveness	CVPR2019	47.8
Lin et. al	IJCAI2020	48.1
VCL	ECCV2020	48.3
Zhou et. al.	CVPR2020	48.9
In-GraphNet	IJCAI-PRICAI 2020	48.9
Interactiveness-optimized	CVPR2019	49.0
TIN-PAMI	TAPMI2021	49.1
IP-Net	CVPR2020	51.0
DRG	ECCV2020	51.0
VSGNet	CVPR2020	51.8
PMN	arXiv	51.8
PMFNet	ICCV2019	52.0
Liu et.al.	arXiv	52.28
FCL	CVPR2021	52.35
PD-Net	ECCV2020	52.6
Wang et.al.	ECCV2020	52.7
PFNet	CVM	52.8
Zou et al.	CVPR2021	52.9
SIGN	ICME2020	53.1
ACP	ECCV2020	52.98 (53.23)
FCMNet	ECCV2020	53.1
HRNet	TIP2021	53.1
ConsNet	ACMMM2020	53.2
IDN	NeurIPS2020	53.3
SG2HOI	ICCV2021	53.3
OSGNet	IEEE Access	53.43
SABRA-Res50	arXiv	53.57
IPGN	TIP2021	53.79
AS-Net	CVPR2021	53.9
RR-Net	arXiv	54.2
SCG	ICCV2021	54.2
SABRA-Res50FPN	arXiv	54.69
GGNet	CVPR2021	54.7
MLCNet	ICMR2020	55.2
HOTR	CVPR2021	55.2
DIRV	AAAI2021	56.1
SABRA-Res152	arXiv	56.62
PhraseHOI	AAAI2022	57.4
GTNet	arXiv	58.29
QPIC-Res101	CVPR2021	58.3
QPIC-Res50	CVPR2021	58.8
UPT-ResNet-101-DC5	arXiv2021	61.3
CDN	NeurIPS2021	63.91

2) Enhanced with HAKE:

Method	Pub	AP(role)
iCAN	CVPR2019	45.3
iCAN + HAKE-Large (transfer learning)	CVPR2020	49.2 (+3.9)
Interactiveness	CVPR2019	47.8
Interactiveness + HAKE-Large (transfer learning)	CVPR2020	51.0 (+3.2)

HOI-COCO:

based on V-COCO

Method	Pub	Full	Seen	Unseen
VCL	ECCV2020	23.53	8.29	35.36
ATL(w/ affordance)	CVPR2021	23.40	8.01	35.34

HICO

1) Default

Method	mAP
R*CNN	28.5
Girdhar et.al.	34.6
Mallya et.al.	36.1
Pairwise	39.9
DEFR-base	44.1
DEFR-CLIP	60.5
DEFR/16 CLIP	65.6

2) Enhanced with HAKE:

Method	mAP
Mallya et.al.	36.1
Mallya et.al.+HAKE-HICO	45.0 (+8.9)
Pairwise	39.9
Pairwise+HAKE-HICO	45.9 (+6.0)
Pairwise+HAKE-Large	46.3 (+6.4)

About

A list of Human-Object Interaction Learning.