zyang-ur / onestage_grounding

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training on custom image

allenkei opened this issue · comments

Hi,

Thank you for the great work!

Just wondering how can I test on my own images, where I have sentences for referring? What do I need to prepare or process for the testing?

Thank you very much for helping in advance.

Hi allenkei,

Thank you for the question.

The short answer is to get the encoded image and word index (

for batch_idx, (imgs, word_id, word_mask, bbox) in enumerate(train_loader):
). You could print the type and value at this line to have a better understanding of the required inputs.

If you just have a few images and wish to run a quick demo, you could try to get the processed input and send them to a pre-trained model for inference. Specifically, getting the resized image (

img, _, ratio, dw, dh = letterbox(img, None, self.imsize)
) and word indexes (
## encode phrase to bert input
).

If you wish to train and test on another dataset, you could check the dataset index list (

self.images += torch.load(imgset_path)
) and try to generate a similar data structure.

Please let me know for any questions. Thank you :)

Dear Zhengyuan,

Thank you for such a detailed answer!

I have tried to put my images in one folder, and then try to generate the same data structure as in the (.pth) file of dataset index list as you suggested. Below I have some question regarding the structure. Take 1 image for example:

('7897.jpg',
'7897_5.pth',----------------------------------------------------------------- what is this and how to generate this file?
[0, 174, 479, 359],
'the sand on the left of the screen',
[('r1', ['sand']),--------------------------------------------------------------- what does r1-r8 represent and how can I
('r2', ['none']), -------------------------------------------------------------- generate the same output for this part?
('r3', ['none']),
('r4', ['none']),
('r5', ['prep_on_left']),
('r6', ['screen']),
('r7', ['none']),
('r8', ['none'])])

If I understand correctly, I can train on my own dataset if I have all images in the targeted folder, and the dataset index list in the same structure. Do I still need to provide anything else beside images and the (.pth) file?

Thank you so much again and for patiently answering my questions. Sorry for some many questions from a beginner like me...

Thank you for the questions :)

Sorry for the confusion, the .pth and r1-r8 are related to some other early explorations that are later suspended, and should have been removed from the data index cache (.pth is the mask annotation, r1-r8 is the sentence parsing results).

In short, these are not necessary, and the image, box, and query ('7897.jpg', [0, 174, 479, 359], 'the sand on the left of the screen') should be the input. (

img_file, bbox, phrase = self.images[idx]
) The one for the flickr dataset should have those unrelated data cache cleaned.

Dear Zhengyuan,

Thank you so much again for your supportive and detailed solution! I will construct the index accordingly and start to train the model. I very much enjoyed reading your great work.