LuoweiZhou / VLP

Vision-Language Pre-training for Image Captioning and Question Answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Require a quick start for simple usage...

wubowen416 opened this issue · comments

Hi, I just want to test the captioning result on some raw images. I have read the vlp/decode_img2txt.py, but the settings are a little bit complicated for me, for example, the standard size of an input image.

So it would be very kind of you if a simple usage could be provided.

I really appreciate any help you can provide.

I second this.
Would be much appreciated

@wubowen416 @mikkelmedm Thanks for your interest in our work. There is no easy way currently. You will need to extract image features using Detectron (https://github.com/LuoweiZhou/detectron-vlp#vlp) and then set the path to your feature/proposal .h5 files accordingly. We were using Caffe2 Detectron back in 2019 which has compatible issues with PyTorch Detectron2. Therefore, an end-to-end solution for the current repo is quite tricky.

Thanks for the reply and info

Do you mean the input of this repo is the output of Detectron on images? So that after the features are extracted using Detectron, an end-to-end way (Detection features -> caption) can be realized within this repo?

Yes, except that you will also set up the annotation file to feed in --src_file (see $DATA_ROOT/COCO/annotations/dataset_coco.json for the example format).

Hey @LuoweiZhou
Would this work for feature extraction https://github.com/airsplay/py-bottom-up-attention?

Yes, except that you will also set up the annotation file to feed in --src_file (see $DATA_ROOT/COCO/annotations/dataset_coco.json for the example format).

Many thanks. I will try!

Hey @LuoweiZhou
Would this work for feature extraction https://github.com/airsplay/py-bottom-up-attention?

@amil-rp-work No, our detection model is separately trained from the original bottom-up detector, despite that we are using the same VG data and annotation (slightly different on some configurations). Our pre-trained models would not work with original bottom-up features. If you do your own training, that feature should still work fine.