zengyan-97 / X-VLM

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About batch sampling `iter_perc`

yangbang18 opened this issue · comments

Thanks for your code.

I note that in your paper, you said "We sample the data by making half of the images in a batch containing bounding box annotations".

But the code is:

X-VLM/Pretrain.py

Lines 82 to 121 in e7b9602

if random.random() < config['regions']['iter_perc']:
try:
region_batch = next(subarea_iter)
except StopIteration:
subarea_iter = iter(region_loader)
region_batch = next(subarea_iter)
image, region_batch = region_batch[0].to(device, non_blocking=True), [
t.to(device) if t is not None else None for t in region_batch[1:]]
idx_to_group_img, text_ids, text_atts, text_ids_masked, masked_pos, masked_ids, \
image_atts, target_bbox, is_image = region_batch
if config['calc_image_bbox_loss']:
is_image = None
optimizer.zero_grad()
loss_itc, loss_itm, loss_mlm, loss_bbox, loss_giou = \
model(image, text_ids, text_atts, text_ids_masked=text_ids_masked, masked_pos=masked_pos, masked_ids=masked_ids,
image_atts=image_atts, idx_to_group_img=idx_to_group_img, target_bbox=target_bbox, is_image=is_image, ret_bbox_loss=True)
loss = loss_itc + loss_itm + loss_mlm + loss_bbox + loss_giou
accelerator.backward_step(loss, optimizer)
accelerator_clip_grad_norm = float(config['accelerator']['CLIP_GRAD_NORM'])
if accelerator_clip_grad_norm > 0:
accelerator.optimizer_step(optimizer, model, accelerator_clip_grad_norm)
optimizer.step()
metric_logger.update(loss_bbox=loss_bbox.item())
metric_logger.update(loss_giou=loss_giou.item())
else:
# fix it
metric_logger.update(loss_bbox=0.5)
metric_logger.update(loss_giou=0.5)
image, batch = batch[0].to(device, non_blocking=True), [t.to(device) if t is not None else None for t in batch[1:]]
text_ids, text_atts, text_ids_masked, masked_pos, masked_ids = batch

The iter_perc you used is 0.5, which means only 50% of time, the model takes a batch of image-text-box data and a batch of image-text data as input; and otherwise, the model only takes a batch of image-text data as input.

Therefore, it seems that iter_perc = 1.0 fits your statement in the paper.

According to the ablation study results in Table 4, you have certainly tested the impact of iter_perc = 0.0 (corresponding to the model, X-VLM w/o all).

So, have you tested more values of iter_perc (e.g., 1.0)?

Hi,

Sorry for my late reply.
I used iter_perc = 1.0. (or at least I used iter_perc = 1.0 in X2-VLM...)
I don't remember clearly.