OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Asynchronous execution: High latency when retrieving results

mvidela31 opened this issue · comments

Hi everyone and thanks for this amazing work!

I tried to perform the asynchronous execution to accelerate the generation inference time following the documentation example:

async_results = []
for batch in batch_generator(): # For-loop 1
    async_results.extend(generator.generate_batch(batch, asynchronous=True))

for async_result in async_results: # For-loop 2
    print(async_result.result())  # This method blocks until the result is available.

First I tried to run generator.generate_batch(batch, asynchronous=True) on a dataset of 1_000 samples with a batch size of 128 and device_index=[0, 1, 2, 3] (4 x Nvidia Tesla T4). The for-loop 1 finishes quickly and the for-loop 2 finishes almost immediately (~3s). However, when a tried to run the same code on a dataset of 100_000 samples, the for-loop 1 also finishes quickly (~20m, which is ok to me), but the for-loop 2 takes ~5 min just for retrieving the results of the first 1_000 samples (the same samples as in the first run).

I think this performance difference (3s vs. 5min in async_result.result() on the same 1_000 samples) could be related to the limited queue size mentioned in the documentation. Is there a way to speed up the asynchronous results retrieving (for-loop 2) so that the processing speed of the first run is recovered?

How do you get the time of the result for the first 1000 samples ? Normally, when 100 000 samples are passed in the for-loop 1, the first 1000 samples have to finished to clean the queue.