tensorflow / profiler

A profiling and performance analysis tool for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why most Ops "cpu execution time" > "accelerator execution time" of Tensorflow Profiler result ?

alphaRGB opened this issue · comments

I profiled a NLP model (implement by tf.keras API) using tensorflow.python.profiler.model_analyzer.Profiler API on GPU. While in the profiling result, the cpu_execution_time is longer than the accelerator_execution_time of most Op,this seems unreasonable. I think that accelerator_execution_time > cpu_execution_time should be reasonable. So I want to know the cause of this problem, thanks.

Since the model is implemented using tf.keras API, in order to use tensorflow.python.profiler.model_analyzer.Profiler, set disable Eager call tf.compat.v1.disable_eager_execution() before create/instance model. I am not sure the cpu_execution_time > accelerator_execution_time is due to the execution of the tf.keras model in Disable Eager mode/Graph mode

Env

  • tensorflow==2.2.0
  • cuda 10.2
  • python==3.7.4

Profiler output:

Doc:
op: The nodes are operation kernel type, such as MatMul, Conv2D. Graph nodes belonging to the same type are aggregated together.
total execution time: Sum of accelerator execution time and cpu execution time.
cpu execution time: The time from the start to the end of the operation. It's the sum of actual cpu run time plus the time that it spends waiting if part of computation is launched asynchronously.
accelerator execution time: Time spent executing on the accelerator. This is normally measured by the actual hardware library.
occurrence: The number of times it occurs

Profile:
node name | total execution time | accelerator execution time | cpu execution time | op occurrence (run|defined)
Conv2D                        329.31ms (100.00%, 9.26%),     192.08ms (100.00%, 21.12%),      137.23ms (100.00%, 5.18%),  1820|1826
MatMul                         340.03ms (90.74%, 9.56%),      140.30ms (78.88%, 15.43%),       199.74ms (94.82%, 7.55%),  7553|7566
BatchMatMulV2                  242.11ms (81.18%, 6.81%),      133.73ms (63.45%, 14.71%),       108.38ms (87.27%, 4.09%),  1810|3632
Mul                            338.77ms (74.37%, 9.53%),        61.64ms (48.74%, 6.78%),      277.12ms (83.17%, 10.47%), 14205|23722
BiasAdd                        203.19ms (64.84%, 5.71%),        41.75ms (41.97%, 4.59%),       161.45ms (72.70%, 6.10%),  9073|9091
Softmax                        143.22ms (59.13%, 4.03%),        39.23ms (37.38%, 4.31%),       103.99ms (66.60%, 3.93%),  1816|1816
Transpose                      184.86ms (55.10%, 5.20%),        38.88ms (33.06%, 4.28%),       145.98ms (62.67%, 5.52%),  7252|9081
RandomUniform                  121.43ms (49.90%, 3.41%),        34.33ms (28.79%, 3.77%),        87.10ms (57.16%, 3.29%),  4845|4892
AddV2                          192.17ms (46.49%, 5.40%),        30.91ms (25.01%, 3.40%),       161.25ms (53.87%, 6.09%),  9065|9094
GreaterEqual                   106.96ms (41.09%, 3.01%),        21.57ms (21.61%, 2.37%),        85.39ms (47.78%, 3.23%),  4838|4847
Cast                           110.81ms (38.08%, 3.12%),        20.94ms (19.24%, 2.30%),        89.87ms (44.55%, 3.40%),  4839|6057
Mean                           143.99ms (34.96%, 4.05%),        20.66ms (16.94%, 2.27%),       123.33ms (41.15%, 4.66%),  6044|6062
ArgMax                          20.96ms (30.91%, 0.59%),        13.77ms (14.67%, 1.51%),         7.19ms (36.49%, 0.27%),    300|300
SquaredDifference               75.66ms (30.32%, 2.13%),        12.66ms (13.15%, 1.39%),        63.00ms (36.22%, 2.38%),  3022|3031
Sub                             72.71ms (28.20%, 2.04%),        11.79ms (11.76%, 1.30%),        60.93ms (33.84%, 2.30%),  3027|3377
RealDiv                         45.07ms (26.15%, 1.27%),         9.79ms (10.46%, 1.08%),        35.28ms (31.54%, 1.33%),  1816|1816
Rsqrt                           62.40ms (24.88%, 1.75%),          8.73ms (9.39%, 0.96%),        53.67ms (30.21%, 2.03%),  3022|3031
SelectV2                        60.66ms (23.13%, 1.71%),          6.68ms (8.43%, 0.74%),        53.97ms (28.18%, 2.04%),   916|1218
Pad                             30.00ms (21.42%, 0.84%),          4.59ms (7.69%, 0.51%),        25.41ms (26.14%, 0.96%),   898|1502
StridedSlice                    32.86ms (20.58%, 0.92%),          4.16ms (7.19%, 0.46%),        28.70ms (25.18%, 1.08%),  1499|3622
Relu                            19.98ms (19.66%, 0.56%),          3.83ms (6.73%, 0.42%),        16.15ms (24.10%, 0.61%),    910|913
ResourceGather                  10.13ms (19.09%, 0.29%),          1.23ms (6.31%, 0.14%),         8.90ms (23.49%, 0.34%),    302|302


In the above table, only Conv2d and BatchMatMulV2 accelerator_execution_time > cpu_execution_time, other Ops cpu_execution_time >accelerator_execution_time


Test code

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.python.profiler import model_analyzer, option_builder
from tensorflow.python.client import timeline

src = np.ones([1,1,240, 348], dtype=np.float32)
tgt = np.ones([1,61], dtype=np.int32)
src_lengths = np.array([348], dtype=np.int32)

# Input
padded_input = tf.convert_to_tensor(src)
input_lengths = tf.convert_to_tensor(src_lengths)
padded_target = tf.convert_to_tensor(tgt)

# preprocess
# seq_in, seq_out = tf_model.decoder.preprocess(padded_target)
seq_in, seq_out = K.ones([1,1000], dtype=tf.int32), K.ones([1,1000], dtype=tf.int32)
 subsequent_mask_ = K.ones((seq_in.shape[1], seq_out.shape[1]), dtype=tf.int8)

    
print('=================Disable Eager================')
 # set context disable Eager
tf.compat.v1.disable_eager_execution()

# create Input
padded_input_h = tf.compat.v1.ones(shape=padded_input.shape, dtype=padded_input.dtype)
input_lengths_h = tf.compat.v1.constant([348], dtype=tf.int32)
padded_target_h = tf.compat.v1.ones(shape=padded_target.shape, dtype=padded_target.dtype)
seq_in_pad_h = tf.compat.v1.ones(shape=seq_in.shape, dtype=seq_in.dtype)
seq_out_pad_h = tf.compat.v1.ones(shape=seq_out.shape, dtype=seq_out.dtype)
subsequent_mask_h = tf.compat.v1.ones(shape=subsequent_mask_.shape, dtype=subsequent_mask_.dtype)

# Create NLP model
model_outs = tf_model(
        padded_input_h, 
        input_lengths_h,
        padded_target_h,
        seq_in_pad=seq_in_pad_h, 
        seq_out_pad=seq_out_pad_h, 
        subsequent_mask_=subsequent_mask_h)

    # Create a Session
with tf.compat.v1.Session() as sess:
        # run init op
        sess.run(tf.compat.v1.global_variables_initializer())
        sess.run(tf.compat.v1.local_variables_initializer())
        sess.run(tf.compat.v1.initialize_all_variables())

        # warn-up
        for i in range(3):
            outs = sess.run(model_outs)
            print(type(outs))
            print('='*50+'warn-up:{}'.format(i+1)+'='*50)
        print('==============Warn-up done')

        # Create Profiler
        profiler = model_analyzer.Profiler(graph=sess.graph)
        run_options = tf.compat.v1.RunOptions(trace_level=tf.compat.v1.RunOptions.FULL_TRACE)
        # RunMetadatad
        run_metadata = tf.compat.v1.RunMetadata()
        # ProfileOptionBuilder
        profile_op_opt_builder = option_builder.ProfileOptionBuilder(tf.compat.v1.profiler.ProfileOptionBuilder.time_and_memory())
        profile_op_opt_builder.select(['micros', 'occurrence'])
        profile_op_opt_builder.order_by('accelerator_micros')
        profile_op_opt_builder.with_max_depth(100000)
        profile_op_opt_builder.with_file_output('profiler_fileoutput_nvidia.txt')

        # Run model
        outs = sess.run(
            model_outs,
            options=run_options, 
            run_metadata=run_metadata
        )

        profiler.add_step(1, run_meta=run_metadata)
        profiler.profile_operations(profile_op_opt_builder.build())

        print('='*20+'Profile done!'+'='*20)
commented

@ckluk Sorry I can not upload screenshot due to company's information security regulations, but the Profiler output is the raw output of TF profiler. I don't use tensorboard, only use tensorflow.python.profiler.model_analyzer.Profiler API

commented

@ckluk Ok, thank you for your advice. I have an other question about profiler: what's the relationship between end-to-end execution_time, cpu_execution_time, accelerator_execution_time when I profile a model on GPU.

cpu_execution_time and accelerator_execution_time can be obtained by Profiler
The end-to-end execution time is measured by: end-to-end execution time = t_end - t_start

t_start = time.time()
outs = model(dump_input)
...  wait for GPU and CPU execution finish
t_end = time.time()

I have profiled several models, I have found that end-to-end execution time < cpu_execution_time + accelerator_execution_time always. I know that GPU compute asynchronously, GPU is mainly for Op compute and CPU is mainly for Op Schedule. I think end-to-end_execution_time = max(cpu_execution_time, accelerator_execution_time) should be correct. I am not very sure weather the conclusion is correct, because I have only observed this through experimantal phenomena, and I am not very clear about the principles of TF Profiler.

commented