Performance issues in /data_provider (by P3)

Question

Performance issues in /data_provider (by P3)

DLPerf opened this issue 3 years ago · comments

Hello! I've found a performance issue in /data_provider/data_feed_pipline.py: batch() should be called before map(), which could make your program more efficient. Here is the tensorflow document to support it.

Detailed description is listed below:

dataset.batch(batch_size, drop_remainder=True)(here) should be called before dataset.map(map_func=tf_io_pipline_tools.decode,num_parallel_calls=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS)(here), dataset.map(map_func=tf_io_pipline_tools.augment_for_train,num_parallel_calls=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS)(here), dataset.map(map_func=tf_io_pipline_tools.augment_for_test,num_parallel_calls=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS)(here) and dataset.map(map_func=tf_io_pipline_tools.normalize,num_parallel_calls=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS)(here).

Besides, you need to check the function called in map()(e.g., tf_io_pipline_tools.normalize called in dataset.map(map_func=tf_io_pipline_tools.normalize,num_parallel_calls=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS)) whether to be affected or not to make the changed code work properly. For example, if tf_io_pipline_tools.normalize needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z).

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

MaybeShewill-CV · Answer 1 · Fri Oct 15 2021 11:03:38 GMT+0800 (China Standard Time)

@DLPerf I have tested in my local env but no obvious performance boost was found here. Have you got a good test result yourself? :)