chenyilun95 / tf-cpn

Cascaded Pyramid Network for Multi-Person Pose Estimation (CVPR 2018)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

about heatmap size

akziq opened this issue · comments

commented

Hi @chenyilun95,great work!
how about generate heatmap size the same as original image (img :256x192 , heatmap: 256x192)? will it increase AP due to pixel to pixel match?
Thanks.

First question is how to upsample to generate the final output heatmap?

  1. Bilinear upsampling will give more accurate gradient back-propagation for each pixel. But in testing, directly upsampling cannot produce the heatmap of higher resolution, which probably reduce the gain. Similar experiment is done in https://github.com/chenyilun95/tf-cpn/issues/4, which may show it doesn't work with better gradient in high resolution.
  2. Skip-connection with the lower feature maps, but their semantics aren't clear probably.
  3. Deconv: recent work (Simple Baseline for Human Pose Estimation) says it's fine with deconvolution layer. But they still upsample the output to 64x48. If that works, it might works as well in higher resolution output.

Nevertheless, that's only my viewpoints. Experiment results says louder !

commented

I apologize for my ambiguous expression.
my question is that the NET 's last layer output is 64x48,which is(W/4,H/4).
how about change the last layer output to 256x192,which is (W,H).
so that orig-img (W,H)->(W/2,H/2)->(W/4,H/4)->.....->(W/4,H/4)->(W/2,H/2)->(W,H),(pre-heatmap)

pixel to pixel match between orig-img and pre-heatmap will increase AP ?

Thanks for your response .

Excuse me... I'm now confused ... how do you change the last layer output to 256x192 ?

commented

for exmaple
1,add some intermediate layer(W/2,H/2) by (Bilinear upsampling / Deconv/Skip-connection )
2,and(Bilinear upsample / Deconv/Skip-connect) it to(W,H).

emmmm... then I think the above comments are my response... Generally, I tend to think it won't work considering efficiency and effectiveness.

commented

@chenyilun95,Thank you,I get it.
I note that most people make the last layer output to 64* 64 (Hourglass Net etc.), 64*48(yours).
so the best practice of last layer output is (W/4,H/4)?
Thanks for your response ,I will close this issue.