zyang-ur / onestage_grounding

Hi,

Thank you for the great work!

Just wondering how can I test on my own images, where I have sentences for referring? What do I need to prepare or process for the testing?

Thank you very much for helping in advance.

Hi allenkei,

Thank you for the question.

The short answer is to get the encoded image and word index (

onestage_grounding/train_yolo.py

Line 381 in e557dc3

for batch_idx, (imgs, word_id, word_mask, bbox) in enumerate(train_loader):

). You could print the type and value at this line to have a better understanding of the required inputs.

If you just have a few images and wish to run a quick demo, you could try to get the processed input and send them to a pre-trained model for inference. Specifically, getting the resized image (

onestage_grounding/dataset/referit_loader.py

Line 301 in e557dc3

img, _, ratio, dw, dh = letterbox(img, None, self.imsize)

) and word indexes (

onestage_grounding/dataset/referit_loader.py

Line 313 in e557dc3

## encode phrase to bert input

).

If you wish to train and test on another dataset, you could check the dataset index list (

onestage_grounding/dataset/referit_loader.py

Line 226 in e557dc3

self.images += torch.load(imgset_path)

) and try to generate a similar data structure.

Please let me know for any questions. Thank you :)

Dear Zhengyuan,

Thank you for such a detailed answer!

I have tried to put my images in one folder, and then try to generate the same data structure as in the (.pth) file of dataset index list as you suggested. Below I have some question regarding the structure. Take 1 image for example:

('7897.jpg',
'7897_5.pth',----------------------------------------------------------------- what is this and how to generate this file?
[0, 174, 479, 359],
'the sand on the left of the screen',
[('r1', ['sand']),--------------------------------------------------------------- what does r1-r8 represent and how can I
('r2', ['none']), -------------------------------------------------------------- generate the same output for this part?
('r3', ['none']),
('r4', ['none']),
('r5', ['prep_on_left']),
('r6', ['screen']),
('r7', ['none']),
('r8', ['none'])])

If I understand correctly, I can train on my own dataset if I have all images in the targeted folder, and the dataset index list in the same structure. Do I still need to provide anything else beside images and the (.pth) file?

Thank you so much again and for patiently answering my questions. Sorry for some many questions from a beginner like me...

Thank you for the questions :)

Sorry for the confusion, the .pth and r1-r8 are related to some other early explorations that are later suspended, and should have been removed from the data index cache (.pth is the mask annotation, r1-r8 is the sentence parsing results).

In short, these are not necessary, and the image, box, and query ('7897.jpg', [0, 174, 479, 359], 'the sand on the left of the screen') should be the input. (

onestage_grounding/dataset/referit_loader.py

Line 233 in e557dc3

img_file, bbox, phrase = self.images[idx]

) The one for the flickr dataset should have those unrelated data cache cleaned.

Dear Zhengyuan,

Thank you so much again for your supportive and detailed solution! I will construct the index accordingly and start to train the model. I very much enjoyed reading your great work.

Training on custom image