fcjian / TOOD

TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Layer Attention instead of Channel Attention?

iumyx2612 opened this issue · comments

Why did you choose Layer Attention instead of normal Channel Attention?
Task-interactive features are concatenated after N consecutive Conv layers, then using Channel Attention could further separate each channels to specific task, instead of Layer Attention, which also conduct separation on channel dim, but can only separate in group of 6?

The tasks of object classification and localization have different targets, and thus focus on different types of features (e.g. different levels or receptive fields). The N interactive layers in T-head have different effective receptive fields, which allow them to capture multiple levels of semantics. The layer attention is designed to make full use of this rich information by computing more meaningful features from those layers.

The tasks of object classification and localization have different targets, and thus focus on different types of features (e.g. different levels or receptive fields). The N interactive layers in T-head have different effective receptive fields, which allow them to capture multiple levels of semantics. The layer attention is designed to make full use of this rich information by computing more meaningful features from those layers.

I understand the intuition of Layer Attention. However, Layer Attention is just a special type of Channel Attention if we conduct Channel Attention on the concatenated feature maps from N interactive layers. And Channel Attention can also capture multi-level semantic features no? 🤔

@fcjian hello can we discuss about this when you have free time?

@iumyx2612 The typical channel-wise attention is applied on a single layer. We hold the view that conducting channel-wise attention on multi-layers can be seen as the combination of the typical channel-wise attention and layer attention. So It can also capture multi-level semantic features but requires more parameters and FLOPs.

@iumyx2612 The typical channel-wise attention is applied on a single layer. We hold the view that conducting channel-wise attention on multi-layers can be seen as the combination of the typical channel-wise attention and layer attention. So It can also capture multi-level semantic features but requires more parameters and FLOPs.

Thank you very much, understood