Why most Ops "cpu execution time" > "accelerator execution time" of Tensorflow Profiler result ?

Question

Why most Ops "cpu execution time" > "accelerator execution time" of Tensorflow Profiler result ?

alphaRGB opened this issue 4 years ago · comments

I profiled a NLP model (implement by tf.keras API) using tensorflow.python.profiler.model_analyzer.Profiler API on GPU. While in the profiling result, the cpu_execution_time is longer than the accelerator_execution_time of most Op，this seems unreasonable. I think that accelerator_execution_time > cpu_execution_time should be reasonable. So I want to know the cause of this problem, thanks.

Since the model is implemented using tf.keras API, in order to use tensorflow.python.profiler.model_analyzer.Profiler, set disable Eager call tf.compat.v1.disable_eager_execution() before create/instance model. I am not sure the cpu_execution_time > accelerator_execution_time is due to the execution of the tf.keras model in Disable Eager mode/Graph mode

Env

tensorflow==2.2.0
cuda 10.2
python==3.7.4

Profiler output:

Doc:
op: The nodes are operation kernel type, such as MatMul, Conv2D. Graph nodes belonging to the same type are aggregated together.
total execution time: Sum of accelerator execution time and cpu execution time.
cpu execution time: The time from the start to the end of the operation. It's the sum of actual cpu run time plus the time that it spends waiting if part of computation is launched asynchronously.
accelerator execution time: Time spent executing on the accelerator. This is normally measured by the actual hardware library.
occurrence: The number of times it occurs

Profile:
node name | total execution time | accelerator execution time | cpu execution time | op occurrence (run|defined)
Conv2D                        329.31ms (100.00%, 9.26%),     192.08ms (100.00%, 21.12%),      137.23ms (100.00%, 5.18%),  1820|1826
MatMul                         340.03ms (90.74%, 9.56%),      140.30ms (78.88%, 15.43%),       199.74ms (94.82%, 7.55%),  7553|7566
BatchMatMulV2                  242.11ms (81.18%, 6.81%),      133.73ms (63.45%, 14.71%),       108.38ms (87.27%, 4.09%),  1810|3632
Mul                            338.77ms (74.37%, 9.53%),        61.64ms (48.74%, 6.78%),      277.12ms (83.17%, 10.47%), 14205|23722
BiasAdd                        203.19ms (64.84%, 5.71%),        41.75ms (41.97%, 4.59%),       161.45ms (72.70%, 6.10%),  9073|9091
Softmax                        143.22ms (59.13%, 4.03%),        39.23ms (37.38%, 4.31%),       103.99ms (66.60%, 3.93%),  1816|1816
Transpose                      184.86ms (55.10%, 5.20%),        38.88ms (33.06%, 4.28%),       145.98ms (62.67%, 5.52%),  7252|9081
RandomUniform                  121.43ms (49.90%, 3.41%),        34.33ms (28.79%, 3.77%),        87.10ms (57.16%, 3.29%),  4845|4892
AddV2                          192.17ms (46.49%, 5.40%),        30.91ms (25.01%, 3.40%),       161.25ms (53.87%, 6.09%),  9065|9094
GreaterEqual                   106.96ms (41.09%, 3.01%),        21.57ms (21.61%, 2.37%),        85.39ms (47.78%, 3.23%),  4838|4847
Cast                           110.81ms (38.08%, 3.12%),        20.94ms (19.24%, 2.30%),        89.87ms (44.55%, 3.40%),  4839|6057
Mean                           143.99ms (34.96%, 4.05%),        20.66ms (16.94%, 2.27%),       123.33ms (41.15%, 4.66%),  6044|6062
ArgMax                          20.96ms (30.91%, 0.59%),        13.77ms (14.67%, 1.51%),         7.19ms (36.49%, 0.27%),    300|300
SquaredDifference               75.66ms (30.32%, 2.13%),        12.66ms (13.15%, 1.39%),        63.00ms (36.22%, 2.38%),  3022|3031
Sub                             72.71ms (28.20%, 2.04%),        11.79ms (11.76%, 1.30%),        60.93ms (33.84%, 2.30%),  3027|3377
RealDiv                         45.07ms (26.15%, 1.27%),         9.79ms (10.46%, 1.08%),        35.28ms (31.54%, 1.33%),  1816|1816
Rsqrt                           62.40ms (24.88%, 1.75%),          8.73ms (9.39%, 0.96%),        53.67ms (30.21%, 2.03%),  3022|3031
SelectV2                        60.66ms (23.13%, 1.71%),          6.68ms (8.43%, 0.74%),        53.97ms (28.18%, 2.04%),   916|1218
Pad                             30.00ms (21.42%, 0.84%),          4.59ms (7.69%, 0.51%),        25.41ms (26.14%, 0.96%),   898|1502
StridedSlice                    32.86ms (20.58%, 0.92%),          4.16ms (7.19%, 0.46%),        28.70ms (25.18%, 1.08%),  1499|3622
Relu                            19.98ms (19.66%, 0.56%),          3.83ms (6.73%, 0.42%),        16.15ms (24.10%, 0.61%),    910|913
ResourceGather                  10.13ms (19.09%, 0.29%),          1.23ms (6.31%, 0.14%),         8.90ms (23.49%, 0.34%),    302|302

In the above table, only Conv2d and BatchMatMulV2 accelerator_execution_time > cpu_execution_time, other Ops cpu_execution_time >accelerator_execution_time

Test code

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.python.profiler import model_analyzer, option_builder
from tensorflow.python.client import timeline

src = np.ones([1,1,240, 348], dtype=np.float32)
tgt = np.ones([1,61], dtype=np.int32)
src_lengths = np.array([348], dtype=np.int32)

# Input
padded_input = tf.convert_to_tensor(src)
input_lengths = tf.convert_to_tensor(src_lengths)
padded_target = tf.convert_to_tensor(tgt)

# preprocess
# seq_in, seq_out = tf_model.decoder.preprocess(padded_target)
seq_in, seq_out = K.ones([1,1000], dtype=tf.int32), K.ones([1,1000], dtype=tf.int32)
 subsequent_mask_ = K.ones((seq_in.shape[1], seq_out.shape[1]), dtype=tf.int8)

    
print('=================Disable Eager================')
 # set context disable Eager
tf.compat.v1.disable_eager_execution()

# create Input
padded_input_h = tf.compat.v1.ones(shape=padded_input.shape, dtype=padded_input.dtype)
input_lengths_h = tf.compat.v1.constant([348], dtype=tf.int32)
padded_target_h = tf.compat.v1.ones(shape=padded_target.shape, dtype=padded_target.dtype)
seq_in_pad_h = tf.compat.v1.ones(shape=seq_in.shape, dtype=seq_in.dtype)
seq_out_pad_h = tf.compat.v1.ones(shape=seq_out.shape, dtype=seq_out.dtype)
subsequent_mask_h = tf.compat.v1.ones(shape=subsequent_mask_.shape, dtype=subsequent_mask_.dtype)

# Create NLP model
model_outs = tf_model(
        padded_input_h, 
        input_lengths_h,
        padded_target_h,
        seq_in_pad=seq_in_pad_h, 
        seq_out_pad=seq_out_pad_h, 
        subsequent_mask_=subsequent_mask_h)

    # Create a Session
with tf.compat.v1.Session() as sess:
        # run init op
        sess.run(tf.compat.v1.global_variables_initializer())
        sess.run(tf.compat.v1.local_variables_initializer())
        sess.run(tf.compat.v1.initialize_all_variables())

        # warn-up
        for i in range(3):
            outs = sess.run(model_outs)
            print(type(outs))
            print('='*50+'warn-up:{}'.format(i+1)+'='*50)
        print('==============Warn-up done')

        # Create Profiler
        profiler = model_analyzer.Profiler(graph=sess.graph)
        run_options = tf.compat.v1.RunOptions(trace_level=tf.compat.v1.RunOptions.FULL_TRACE)
        # RunMetadatad
        run_metadata = tf.compat.v1.RunMetadata()
        # ProfileOptionBuilder
        profile_op_opt_builder = option_builder.ProfileOptionBuilder(tf.compat.v1.profiler.ProfileOptionBuilder.time_and_memory())
        profile_op_opt_builder.select(['micros', 'occurrence'])
        profile_op_opt_builder.order_by('accelerator_micros')
        profile_op_opt_builder.with_max_depth(100000)
        profile_op_opt_builder.with_file_output('profiler_fileoutput_nvidia.txt')

        # Run model
        outs = sess.run(
            model_outs,
            options=run_options, 
            run_metadata=run_metadata
        )

        profiler.add_step(1, run_meta=run_metadata)
        profiler.profile_operations(profile_op_opt_builder.build())

        print('='*20+'Profile done!'+'='*20)

ckluk · Answer 1 · Wed Jul 22 2020 01:20:56 GMT+0800 (China Standard Time)

Hi, Can you share the screenshot of tool output that you question about? -ck

…

On Mon, Jul 20, 2020 at 10:09 PM alphaRGB ***@***.***> wrote: I profiled a NLP model (implement by tf.keras API) using tensorflow.python.profiler.model_analyzer.Profiler API on GPU. While in the profiling result, the *cpu_execution_time is longer than the accelerator_execution_time*， this seems unreasonable. I want to know the cause of this problem, thanks. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#108>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3KHH54DIKEWEO7FI63R4UPHBANCNFSM4PDE2F5Q> .

alphaRGB · Answer 2 · Wed Jul 22 2020 09:05:45 GMT+0800 (China Standard Time)

@ckluk Sorry I can not upload screenshot due to company's information security regulations, but the Profiler output is the raw output of TF profiler. I don't use tensorboard, only use tensorflow.python.profiler.model_analyzer.Profiler API

ckluk · Answer 3 · Wed Jul 22 2020 09:28:09 GMT+0800 (China Standard Time)

tensorflow.python.profiler.model_analyzer.Profiler is not supported by the TF profiler team here at Google. I don't know how it works. I encourage you to use https://www.tensorflow.org/guide/profiler. If you encounter any questions, we will try our best to help. thanks. -ck

…

On Tue, Jul 21, 2020 at 6:06 PM alphaRGB ***@***.***> wrote: @ckluk <https://github.com/ckluk> Sorry I can not upload screenshot due to company's information security regulations, but the "Profiler output" is the raw output of TF profiler. I don't use tensorboard, only use tensorflow.python.profiler.model_analyzer.Profiler API — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#108 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3OBSA4ADLYVRX4QL5DR4Y3PRANCNFSM4PDE2F5Q> .

alphaRGB · Answer 4 · Wed Jul 22 2020 10:40:32 GMT+0800 (China Standard Time)

@ckluk Ok, thank you for your advice. I have an other question about profiler: what's the relationship between end-to-end execution_time, cpu_execution_time, accelerator_execution_time when I profile a model on GPU.

cpu_execution_time and accelerator_execution_time can be obtained by Profiler
The end-to-end execution time is measured by: end-to-end execution time = t_end - t_start

t_start = time.time()
outs = model(dump_input)
...  wait for GPU and CPU execution finish
t_end = time.time()

I have profiled several models, I have found that end-to-end execution time < cpu_execution_time + accelerator_execution_time always. I know that GPU compute asynchronously, GPU is mainly for Op compute and CPU is mainly for Op Schedule. I think end-to-end_execution_time = max(cpu_execution_time, accelerator_execution_time) should be correct. I am not very sure weather the conclusion is correct, because I have only observed this through experimantal phenomena, and I am not very clear about the principles of TF Profiler.

ckluk · Answer 5 · Thu Jul 23 2020 00:55:47 GMT+0800 (China Standard Time)

I am not sure about this particular profiler, but I would think: end-to-end execution time = MAX(cpu-end-time, gpu-end-time) - MIN(cpu-start-time, gpu-start-time) cpu-execution-time = cpu-end-time - cpu-start-time gpu-execution-time = gpu-end-time - gpu-start-time If this is the case, end-to-end execution time < cpu-execution-time + gpu-execution-time is expected as GPU and CPU overlap their execution.

…

On Tue, Jul 21, 2020 at 7:40 PM alphaRGB ***@***.***> wrote: @ckluk <https://github.com/ckluk> Ok, thank you for your advice. I have an other question about profiler: what's the relationship between *end-to-end execution_time*, *cpu_execution_time*, *accelerator_execution_time* when I profile a model on GPU. *cpu_execution_time* and *accelerator_execution_time* can be obtioned by Profiler The *end-to-end execution time* is measured by: end-to-end execution time = t_end - t_start t_start = time.time() outs = model(dump_input) ... wait for GPU and CPU execution finish t_end = time.time() I have profiled several models, I have found that *end-to-end execution time < cpu_execution_time + accelerator_execution_time* always. I know that GPU compute asynchronously, GPU is mainly for Op compute and CPU is mainly for Op Schedule. I think *end-to-end_execution_time = max(cpu_execution_time, accelerator_execution_time)* should be correct. I am not very sure weather the conclusion is correct, because I have only observed this through experimantal phenomena, and I am not very clear about the principles of TF Profiler. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#108 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3LAD3QGI4XET4BFEB3R4ZGS7ANCNFSM4PDE2F5Q> .