zengyan-97 / X-VLM

Thanks for your code.

I note that in your paper, you said "We sample the data by making half of the images in a batch containing bounding box annotations".

But the code is:

X-VLM/Pretrain.py

Lines 82 to 121 in e7b9602

    
           if random.random() < config['regions']['iter_perc']: 
        
               try: 
        
                   region_batch = next(subarea_iter) 
        
               except StopIteration: 
        
                   subarea_iter = iter(region_loader) 
        
                   region_batch = next(subarea_iter) 
        
               image, region_batch = region_batch[0].to(device, non_blocking=True), [ 
        
                   t.to(device) if t is not None else None for t in region_batch[1:]] 
        
               idx_to_group_img, text_ids, text_atts, text_ids_masked, masked_pos, masked_ids, \ 
        
                   image_atts, target_bbox, is_image = region_batch 
        
               if config['calc_image_bbox_loss']: 
        
                   is_image = None 
        
               optimizer.zero_grad() 
        
               loss_itc, loss_itm, loss_mlm, loss_bbox, loss_giou = \ 
        
                   model(image, text_ids, text_atts, text_ids_masked=text_ids_masked, masked_pos=masked_pos, masked_ids=masked_ids, 
        
                         image_atts=image_atts, idx_to_group_img=idx_to_group_img, target_bbox=target_bbox, is_image=is_image, ret_bbox_loss=True) 
        
               loss = loss_itc + loss_itm + loss_mlm + loss_bbox + loss_giou 
        
               accelerator.backward_step(loss, optimizer) 
        
               accelerator_clip_grad_norm = float(config['accelerator']['CLIP_GRAD_NORM']) 
        
               if accelerator_clip_grad_norm > 0: 
        
                   accelerator.optimizer_step(optimizer, model, accelerator_clip_grad_norm) 
        
               optimizer.step() 
        
               metric_logger.update(loss_bbox=loss_bbox.item()) 
        
               metric_logger.update(loss_giou=loss_giou.item()) 
        
           else: 
        
               # fix it 
        
               metric_logger.update(loss_bbox=0.5) 
        
               metric_logger.update(loss_giou=0.5) 
        
           image, batch = batch[0].to(device, non_blocking=True), [t.to(device) if t is not None else None for t in batch[1:]] 
        
           text_ids, text_atts, text_ids_masked, masked_pos, masked_ids = batch

The iter_perc you used is 0.5, which means only 50% of time, the model takes a batch of image-text-box data and a batch of image-text data as input; and otherwise, the model only takes a batch of image-text data as input.

Therefore, it seems that iter_perc = 1.0 fits your statement in the paper.

According to the ablation study results in Table 4, you have certainly tested the impact of iter_perc = 0.0 (corresponding to the model, X-VLM w/o all).

So, have you tested more values of iter_perc (e.g., 1.0)?

Hi,

Sorry for my late reply.
I used iter_perc = 1.0. (or at least I used iter_perc = 1.0 in X2-VLM...)
I don't remember clearly.

	if random.random() < config['regions']['iter_perc']:
	try:
	region_batch = next(subarea_iter)
	except StopIteration:
	subarea_iter = iter(region_loader)
	region_batch = next(subarea_iter)

	image, region_batch = region_batch[0].to(device, non_blocking=True), [
	t.to(device) if t is not None else None for t in region_batch[1:]]

	idx_to_group_img, text_ids, text_atts, text_ids_masked, masked_pos, masked_ids, \
	image_atts, target_bbox, is_image = region_batch

	if config['calc_image_bbox_loss']:
	is_image = None

	optimizer.zero_grad()

	loss_itc, loss_itm, loss_mlm, loss_bbox, loss_giou = \
	model(image, text_ids, text_atts, text_ids_masked=text_ids_masked, masked_pos=masked_pos, masked_ids=masked_ids,
	image_atts=image_atts, idx_to_group_img=idx_to_group_img, target_bbox=target_bbox, is_image=is_image, ret_bbox_loss=True)

	loss = loss_itc + loss_itm + loss_mlm + loss_bbox + loss_giou
	accelerator.backward_step(loss, optimizer)

	accelerator_clip_grad_norm = float(config['accelerator']['CLIP_GRAD_NORM'])
	if accelerator_clip_grad_norm > 0:
	accelerator.optimizer_step(optimizer, model, accelerator_clip_grad_norm)
	optimizer.step()

	metric_logger.update(loss_bbox=loss_bbox.item())
	metric_logger.update(loss_giou=loss_giou.item())

	else:
	# fix it
	metric_logger.update(loss_bbox=0.5)
	metric_logger.update(loss_giou=0.5)

	image, batch = batch[0].to(device, non_blocking=True), [t.to(device) if t is not None else None for t in batch[1:]]
	text_ids, text_atts, text_ids_masked, masked_pos, masked_ids = batch

About batch sampling `iter_perc`