aimagelab / meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generating HDF5 detections from custom dataset or bottom-up-attention TSV

SandroJijavadze opened this issue · comments

commented

I have a custom dataset,

I have generated the detections TSV using : https://github.com/airsplay/py-bottom-up-attention
But the model requires HDF5.

TSV has these per each example:

{
   'image_id': image_id,
   'image_h': np.size(im, 0),
   'image_w': np.size(im, 1),
   'num_boxes' : len(keep_boxes),
   'boxes': base64.b64encode(cls_boxes[keep_boxes]),
   'features': base64.b64encode(pool5[keep_boxes])
}  

When examining the coco dataset examples I see the following for example:

>>> dts["35368_boxes"]
<HDF5 dataset "35368_boxes": shape (37, 4), type "<f4">
>>> dts["35368_features"]
<HDF5 dataset "35368_features": shape (37, 2048), type "<f4">
>>> dts["35368_cls_prob"]
<HDF5 dataset "35368_cls_prob": shape (37, 1601), type "<f4">
>>> dts["35368_boxes"][36]
array([349.57147, 154.07967, 420.0327 , 408.64462], dtype=float32)

I'll try to figure out how to convert my TSV to required HDF5 myself from the code but guide would be appreciated.

Thank you.

commented

I have a custom dataset,

I have generated the detections TSV using : https://github.com/airsplay/py-bottom-up-attention
But the model requires HDF5.

TSV has these per each example:

{
   'image_id': image_id,
   'image_h': np.size(im, 0),
   'image_w': np.size(im, 1),
   'num_boxes' : len(keep_boxes),
   'boxes': base64.b64encode(cls_boxes[keep_boxes]),
   'features': base64.b64encode(pool5[keep_boxes])
}  

When examining the coco dataset examples I see the following for example:

>>> dts["35368_boxes"]
<HDF5 dataset "35368_boxes": shape (37, 4), type "<f4">
>>> dts["35368_features"]
<HDF5 dataset "35368_features": shape (37, 2048), type "<f4">
>>> dts["35368_cls_prob"]
<HDF5 dataset "35368_cls_prob": shape (37, 1601), type "<f4">
>>> dts["35368_boxes"][36]
array([349.57147, 154.07967, 420.0327 , 408.64462], dtype=float32)

I'll try to figure out how to convert my TSV to required HDF5 myself from the code but guide would be appreciated.

Thank you.

Do you solve this problem?

commented

@whongchen
No unfortunately,
I am going to try figure out the process myself this week. Will give update if I do.
Please comment if you find anything useful.

I'm working on this either, still haven't done it myself but I think you just need to convert the tsv into a hdf5 file, it has nothing to do with M2T or py-bottom-up-attention code.
You read your tsv using csv or pandas and then you can use libraries like h5py to store and save your data in hdf5 format using names "_boxes", "_features" and "_cls_prob", in which you put data relative to bounding box corners, feature vectors and class probabilities, as specified in M2T repo readme file.
I believe it would be straightforward, don't know about how much time it would take.
Let me know if you manage to do it

Hi everyone,
thank you @eugeniotonanzi for your answer, that should exactly solve the problem.
Once you have a hdf5 file for your custom dataset with the same format, the model should work as expected.
Let us know if you have any other issues.
Best,
Matteo

commented

That solved it, closing this issue.
Thank you.

That solved it, closing this issue.
Thank you.
Have you solved this problem, can it be convenient to release the relevant code, thank you!

That solved it, closing this issue.
Thank you.
Hi, have you solved this problem, can it be convenient to release the relevant code, thank you very much

@eugeniotonanzi thanks for your advice, I'm working with it right now, but maybe you've already implemented it?

commented

@hwbhwbgao @ksz-creat @MikeMACintosh
I didn't see your replies.
Unfortunately I can't share the whole code, but I will share relevant bits
I modified 2 methods in https://github.com/peteanderson80/bottom-up-attention

def get_detections_from_im(net, im_file, image_id, conf_thresh=0.2):
    im = cv2.imread(im_file)
    scores, boxes, attr_scores, rel_scores = im_detect(net, im)

    # Keep the original boxes, don't worry about the regresssion bbox outputs
    rois = net.blobs['rois'].data.copy()
    # unscale back to raw image space
    blobs, im_scales = _get_blobs(im, None)

    cls_boxes = rois[:, 1:5] / im_scales[0]
    cls_prob = net.blobs['cls_prob'].data
    pool5 = net.blobs['pool5_flat'].data

    # Keep only the best detections
    max_conf = np.zeros((rois.shape[0]))
    for cls_ind in range(1,cls_prob.shape[1]):
        cls_scores = scores[:, cls_ind]
        dets = np.hstack((cls_boxes, cls_scores[:, np.newaxis])).astype(np.float32)
        keep = np.array(nms(dets, cfg.TEST.NMS))
        max_conf[keep] = np.where(cls_scores[keep] > max_conf[keep], cls_scores[keep], max_conf[keep])

    keep_boxes = np.where(max_conf >= conf_thresh)[0]
    if len(keep_boxes) < MIN_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MIN_BOXES]
    elif len(keep_boxes) > MAX_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MAX_BOXES]
    featureid = "".join([s.lstrip("0") for s in image_id.split() if s.isdigit()])
    num_boxes = len(keep_boxes)
    cls_boxes = cls_boxes[keep_boxes].reshape((num_boxes, 4))
    cls_features = pool5[keep_boxes].reshape(num_boxes, 2048)
    cls_prob = cls_prob[keep_boxes].reshape(num_boxes, 1601)

    return (featureid + "_boxes", cls_boxes), (featureid + "_features", cls_features), (featureid + "_cls_prob", cls_prob)

https://github.com/peteanderson80/bottom-up-attention/blob/master/tools/generate_tsv.py#L140

def generate_hdf5(gpu_id, prototxt, weights, image_ids, outfile):
    wanted_ids = set([int(image_id[1]) for image_id in image_ids])
    found_ids = set()

    missing = wanted_ids - found_ids
    if len(missing) == 0:
        print 'GPU {:d}: already completed {:d}'.format(gpu_id, len(image_ids))
    else:
        print 'GPU {:d}: missing {:d}/{:d}'.format(gpu_id, len(missing), len(image_ids))
    if len(missing) > 0:
        caffe.set_mode_gpu()
        caffe.set_device(gpu_id)
        net = caffe.Net(prototxt, caffe.TEST, weights=weights)
        with h5py.File(outfile, 'w') as h5pyfile:
           # writer = csv.DictWriter(tsvfile, delimiter = '\t', fieldnames = FIELDNAMES)
            _t = {'misc' : Timer()}
            count = 0
            for im_file,image_id in image_ids:
                if int(image_id) in missing:
                    _t['misc'].tic()
                    boxes, features, probabilities = get_detections_from_im(net, im_file, image_id)
                    h5pyfile.create_dataset(boxes[0], data=boxes[1])
                    h5pyfile.create_dataset(features[0], data=features[1])
                    h5pyfile.create_dataset(probabilities[0], data=probabilities[1])
                    if (count % 100) == 0:
                        print 'GPU {:d}: {:d}/{:d} {:.3f}s (projected finish: {:.2f} hours)' \
                              .format(gpu_id, count+1, len(missing), _t['misc'].average_time,
                              _t['misc'].average_time*(len(missing)-count)/3600)
                    count += 1

Also depending on how have you arranged your data you will need to modify "load_image_ids" method.

You can use this docker image for environment:
https://hub.docker.com/r/airsplay/bottom-up-attention

Thank you very much!

@hwbhwbgao @ksz-creat @MikeMACintosh I didn't see your replies. Unfortunately I can't share the whole code, but I will share relevant bits I modified 2 methods in https://github.com/peteanderson80/bottom-up-attention

def get_detections_from_im(net, im_file, image_id, conf_thresh=0.2):
    im = cv2.imread(im_file)
    scores, boxes, attr_scores, rel_scores = im_detect(net, im)

    # Keep the original boxes, don't worry about the regresssion bbox outputs
    rois = net.blobs['rois'].data.copy()
    # unscale back to raw image space
    blobs, im_scales = _get_blobs(im, None)

    cls_boxes = rois[:, 1:5] / im_scales[0]
    cls_prob = net.blobs['cls_prob'].data
    pool5 = net.blobs['pool5_flat'].data

    # Keep only the best detections
    max_conf = np.zeros((rois.shape[0]))
    for cls_ind in range(1,cls_prob.shape[1]):
        cls_scores = scores[:, cls_ind]
        dets = np.hstack((cls_boxes, cls_scores[:, np.newaxis])).astype(np.float32)
        keep = np.array(nms(dets, cfg.TEST.NMS))
        max_conf[keep] = np.where(cls_scores[keep] > max_conf[keep], cls_scores[keep], max_conf[keep])

    keep_boxes = np.where(max_conf >= conf_thresh)[0]
    if len(keep_boxes) < MIN_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MIN_BOXES]
    elif len(keep_boxes) > MAX_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MAX_BOXES]
    featureid = "".join([s.lstrip("0") for s in image_id.split() if s.isdigit()])
    num_boxes = len(keep_boxes)
    cls_boxes = cls_boxes[keep_boxes].reshape((num_boxes, 4))
    cls_features = pool5[keep_boxes].reshape(num_boxes, 2048)
    cls_prob = cls_prob[keep_boxes].reshape(num_boxes, 1601)

    return (featureid + "_boxes", cls_boxes), (featureid + "_features", cls_features), (featureid + "_cls_prob", cls_prob)

https://github.com/peteanderson80/bottom-up-attention/blob/master/tools/generate_tsv.py#L140

def generate_hdf5(gpu_id, prototxt, weights, image_ids, outfile):
    wanted_ids = set([int(image_id[1]) for image_id in image_ids])
    found_ids = set()

    missing = wanted_ids - found_ids
    if len(missing) == 0:
        print 'GPU {:d}: already completed {:d}'.format(gpu_id, len(image_ids))
    else:
        print 'GPU {:d}: missing {:d}/{:d}'.format(gpu_id, len(missing), len(image_ids))
    if len(missing) > 0:
        caffe.set_mode_gpu()
        caffe.set_device(gpu_id)
        net = caffe.Net(prototxt, caffe.TEST, weights=weights)
        with h5py.File(outfile, 'w') as h5pyfile:
           # writer = csv.DictWriter(tsvfile, delimiter = '\t', fieldnames = FIELDNAMES)
            _t = {'misc' : Timer()}
            count = 0
            for im_file,image_id in image_ids:
                if int(image_id) in missing:
                    _t['misc'].tic()
                    boxes, features, probabilities = get_detections_from_im(net, im_file, image_id)
                    h5pyfile.create_dataset(boxes[0], data=boxes[1])
                    h5pyfile.create_dataset(features[0], data=features[1])
                    h5pyfile.create_dataset(probabilities[0], data=probabilities[1])
                    if (count % 100) == 0:
                        print 'GPU {:d}: {:d}/{:d} {:.3f}s (projected finish: {:.2f} hours)' \
                              .format(gpu_id, count+1, len(missing), _t['misc'].average_time,
                              _t['misc'].average_time*(len(missing)-count)/3600)
                    count += 1

Also depending on how have you arranged your data you will need to modify "load_image_ids" method.

You can use this docker image for environment: https://hub.docker.com/r/airsplay/bottom-up-attention

thank you very much for the work you did, btw, i am not familiar with docker, would you please tell me how to use the docker image you provide? where should i modify? looking forward to your reply!