Code to compute images/sec

Question

Code to compute images/sec

prabhuteja12 opened this issue 2 years ago · comments

Hi,

Thank you for the cool work!

I see that you report images/sec, and mention the following in the paper:

To compute the images per second, we use a V100 GPU, fix the image resolution to 512 and for each model we maximize the batch size allowed by memory for a fair comparison.

I'm trying to do the same, however I'm unable to reproduce the numbers you of images/sec in the paper.

I'm using the code snippet from PyTorch as follows:

    batch = torch.rand(args.batch_size, *input_shape).cuda()
    model(batch)
    n_runs = 10
    from torch.utils.benchmark import Timer

    t = Timer(stmt="model.forward(batch)", globals={"model": model, "batch": batch})
    m = t.timeit(n_runs)

The batch size that fits on V100 for Vit-T backbone is about 140. And the above code shows a timing of 0.62 seconds. So I'm computing the total images/sec = 140/0.62 = 225.8. This is almost half the numbers in Table 3. Can you please help me with what I need to do to get the mentioned result?

Thank you!

rstrudel · Answer 1 · Wed May 18 2022 01:24:40 GMT+0800 (China Standard Time)

Hi @prabhuteja12 ,

To perform robust benchmarking you can check timm repository, in particular the following file:

https://github.com/rwightman/pytorch-image-models/blob/master/benchmark.py#L244

Two things I do not see in your snippet and you can spot in timm are:

warmup steps
torch.no_grad

I hope this helps!

Prabhu Teja · Answer 2 · Wed May 18 2022 03:06:15 GMT+0800 (China Standard Time)

Hi @rstrudel ,

Thank you for the quick reply, and the pointers to timm.

PyTorch Timer documentation says that it runs a warmup before
I missed pasting the snippet, but there is torch.set_grad_enabled(False) in my code.

Additionally, timm uses perf_counter and so does PyTorch.

So I'm unable to figure out what is causing this massive drop in speed. If you have access to your specific benchmarking script, can you share that?

rstrudel · Answer 3 · Wed May 18 2022 21:10:32 GMT+0800 (China Standard Time)

The snippet of code is based on timm benchmarking and we do not plan to release a clean version of it as of now.
Table 3 refers to Segmenter models with a linear decoder, did you make sure you measured timings with this model and not the mask decoder (which is more expensive)?
I will try to check this in the following weeks.

Prabhu Teja · Answer 4 · Thu May 19 2022 02:34:48 GMT+0800 (China Standard Time)

@rstrudel I think I was able to get the numbers (in that range). The data and the model need to be in float16. Please confirm that this is what you did as well.

Thanks!

rstrudel · Answer 5 · Thu May 19 2022 06:21:58 GMT+0800 (China Standard Time)

Here is the snippet of the code I used to check throughput.
So indeed good call I was using mixed precision, well spotted!

Your use of Timer is probably a cleaner way of doing it, not sure I had it (at least did not know about it) at the time of the experiments.

@torch.no_grad()
def compute_throughput(model, batch_size, resolution):
    torch.cuda.empty_cache()
    warmup_iters = 3
    num_iters = 30

    timing = []

    x = torch.randn(batch_size, 3, resolution, resolution, device=device)

    # warmup
    for _ in range(warmup_iters):
        with torch.cuda.amp.autocast():
            y = model(x)

    MB = 1024.0 * 1024.0
    torch.cuda.synchronize()
    for i in range(num_iters):
        memory = int(torch.cuda.max_memory_allocated() / MB)
        if i == 0:
            print(f"memory: {memory} MB")
        start = time.time()
        with torch.cuda.amp.autocast():
            y = model(x)
        torch.cuda.synchronize()
        timing.append(time.time() - start)

    timing = torch.as_tensor(timing, dtype=torch.float32)
    return batch_size / timing.mean()

Prabhu Teja · Answer 6 · Thu May 19 2022 15:24:51 GMT+0800 (China Standard Time)

Great! Thank you!