Question about some part of the architecture

Question

Question about some part of the architecture

opened this issue a year ago · comments

I read your 'AGRL' paper many times but I don't understand some parts of it, could you please help me to understand it better

First of all, what do the 'global branch' do and why you use it?
Secondly, after conv5, there is no image, we have some features instead, how did you apply pyramid pooling and match the joints(extracted joints by AlphaPose) in the regions? because in the feature maps the location of joints are not the same as original images.
Third, what is 'N' in the input? If it is the number of regions in the input, there is no region before pyramid pooling I don't understand.

Wu Yiming · Answer 1 · Wed Jan 04 2023 11:38:33 GMT+0800 (China Standard Time)

@Lisa9797 Thanks for your questions.

Global Branch aims to capture the global information, and Graph Branch aims to capture the part-level information. The pooling modules (GAP vs. Pyramid Pooling) are different in these two branches.
As claimed in Section III.A, features output from res_conv5 is a $N\times T\times D \times h \times w$-dim feature maps, pyramid pooling is applied on the last two dimensions, we can get $N\times T\times D \times 7$-dim features. As for the location of joints, we roughly assign each region a human part label (Please see Figure 3 as an example).
As described in the caption of Figure 2, N is the number of regions for an individual image, which is defined as 7 in our paper.