Why most Ops "cpu execution time" > "accelerator execution time" of Tensorflow Profiler result ?
alphaRGB opened this issue · comments
I profiled a NLP model (implement by tf.keras API) using tensorflow.python.profiler.model_analyzer.Profiler
API on GPU. While in the profiling result, the cpu_execution_time is longer than the accelerator_execution_time of most Op,this seems unreasonable. I think that accelerator_execution_time > cpu_execution_time should be reasonable. So I want to know the cause of this problem, thanks.
Since the model is implemented using tf.keras
API, in order to use tensorflow.python.profiler.model_analyzer.Profiler
, set disable Eager call tf.compat.v1.disable_eager_execution()
before create/instance model. I am not sure the cpu_execution_time > accelerator_execution_time is due to the execution of the tf.keras
model in Disable Eager mode/Graph mode
Env
- tensorflow==2.2.0
- cuda 10.2
- python==3.7.4
Profiler output:
Doc:
op: The nodes are operation kernel type, such as MatMul, Conv2D. Graph nodes belonging to the same type are aggregated together.
total execution time: Sum of accelerator execution time and cpu execution time.
cpu execution time: The time from the start to the end of the operation. It's the sum of actual cpu run time plus the time that it spends waiting if part of computation is launched asynchronously.
accelerator execution time: Time spent executing on the accelerator. This is normally measured by the actual hardware library.
occurrence: The number of times it occurs
Profile:
node name | total execution time | accelerator execution time | cpu execution time | op occurrence (run|defined)
Conv2D 329.31ms (100.00%, 9.26%), 192.08ms (100.00%, 21.12%), 137.23ms (100.00%, 5.18%), 1820|1826
MatMul 340.03ms (90.74%, 9.56%), 140.30ms (78.88%, 15.43%), 199.74ms (94.82%, 7.55%), 7553|7566
BatchMatMulV2 242.11ms (81.18%, 6.81%), 133.73ms (63.45%, 14.71%), 108.38ms (87.27%, 4.09%), 1810|3632
Mul 338.77ms (74.37%, 9.53%), 61.64ms (48.74%, 6.78%), 277.12ms (83.17%, 10.47%), 14205|23722
BiasAdd 203.19ms (64.84%, 5.71%), 41.75ms (41.97%, 4.59%), 161.45ms (72.70%, 6.10%), 9073|9091
Softmax 143.22ms (59.13%, 4.03%), 39.23ms (37.38%, 4.31%), 103.99ms (66.60%, 3.93%), 1816|1816
Transpose 184.86ms (55.10%, 5.20%), 38.88ms (33.06%, 4.28%), 145.98ms (62.67%, 5.52%), 7252|9081
RandomUniform 121.43ms (49.90%, 3.41%), 34.33ms (28.79%, 3.77%), 87.10ms (57.16%, 3.29%), 4845|4892
AddV2 192.17ms (46.49%, 5.40%), 30.91ms (25.01%, 3.40%), 161.25ms (53.87%, 6.09%), 9065|9094
GreaterEqual 106.96ms (41.09%, 3.01%), 21.57ms (21.61%, 2.37%), 85.39ms (47.78%, 3.23%), 4838|4847
Cast 110.81ms (38.08%, 3.12%), 20.94ms (19.24%, 2.30%), 89.87ms (44.55%, 3.40%), 4839|6057
Mean 143.99ms (34.96%, 4.05%), 20.66ms (16.94%, 2.27%), 123.33ms (41.15%, 4.66%), 6044|6062
ArgMax 20.96ms (30.91%, 0.59%), 13.77ms (14.67%, 1.51%), 7.19ms (36.49%, 0.27%), 300|300
SquaredDifference 75.66ms (30.32%, 2.13%), 12.66ms (13.15%, 1.39%), 63.00ms (36.22%, 2.38%), 3022|3031
Sub 72.71ms (28.20%, 2.04%), 11.79ms (11.76%, 1.30%), 60.93ms (33.84%, 2.30%), 3027|3377
RealDiv 45.07ms (26.15%, 1.27%), 9.79ms (10.46%, 1.08%), 35.28ms (31.54%, 1.33%), 1816|1816
Rsqrt 62.40ms (24.88%, 1.75%), 8.73ms (9.39%, 0.96%), 53.67ms (30.21%, 2.03%), 3022|3031
SelectV2 60.66ms (23.13%, 1.71%), 6.68ms (8.43%, 0.74%), 53.97ms (28.18%, 2.04%), 916|1218
Pad 30.00ms (21.42%, 0.84%), 4.59ms (7.69%, 0.51%), 25.41ms (26.14%, 0.96%), 898|1502
StridedSlice 32.86ms (20.58%, 0.92%), 4.16ms (7.19%, 0.46%), 28.70ms (25.18%, 1.08%), 1499|3622
Relu 19.98ms (19.66%, 0.56%), 3.83ms (6.73%, 0.42%), 16.15ms (24.10%, 0.61%), 910|913
ResourceGather 10.13ms (19.09%, 0.29%), 1.23ms (6.31%, 0.14%), 8.90ms (23.49%, 0.34%), 302|302
In the above table, only Conv2d and BatchMatMulV2 accelerator_execution_time > cpu_execution_time, other Ops cpu_execution_time >accelerator_execution_time
Test code
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.python.profiler import model_analyzer, option_builder
from tensorflow.python.client import timeline
src = np.ones([1,1,240, 348], dtype=np.float32)
tgt = np.ones([1,61], dtype=np.int32)
src_lengths = np.array([348], dtype=np.int32)
# Input
padded_input = tf.convert_to_tensor(src)
input_lengths = tf.convert_to_tensor(src_lengths)
padded_target = tf.convert_to_tensor(tgt)
# preprocess
# seq_in, seq_out = tf_model.decoder.preprocess(padded_target)
seq_in, seq_out = K.ones([1,1000], dtype=tf.int32), K.ones([1,1000], dtype=tf.int32)
subsequent_mask_ = K.ones((seq_in.shape[1], seq_out.shape[1]), dtype=tf.int8)
print('=================Disable Eager================')
# set context disable Eager
tf.compat.v1.disable_eager_execution()
# create Input
padded_input_h = tf.compat.v1.ones(shape=padded_input.shape, dtype=padded_input.dtype)
input_lengths_h = tf.compat.v1.constant([348], dtype=tf.int32)
padded_target_h = tf.compat.v1.ones(shape=padded_target.shape, dtype=padded_target.dtype)
seq_in_pad_h = tf.compat.v1.ones(shape=seq_in.shape, dtype=seq_in.dtype)
seq_out_pad_h = tf.compat.v1.ones(shape=seq_out.shape, dtype=seq_out.dtype)
subsequent_mask_h = tf.compat.v1.ones(shape=subsequent_mask_.shape, dtype=subsequent_mask_.dtype)
# Create NLP model
model_outs = tf_model(
padded_input_h,
input_lengths_h,
padded_target_h,
seq_in_pad=seq_in_pad_h,
seq_out_pad=seq_out_pad_h,
subsequent_mask_=subsequent_mask_h)
# Create a Session
with tf.compat.v1.Session() as sess:
# run init op
sess.run(tf.compat.v1.global_variables_initializer())
sess.run(tf.compat.v1.local_variables_initializer())
sess.run(tf.compat.v1.initialize_all_variables())
# warn-up
for i in range(3):
outs = sess.run(model_outs)
print(type(outs))
print('='*50+'warn-up:{}'.format(i+1)+'='*50)
print('==============Warn-up done')
# Create Profiler
profiler = model_analyzer.Profiler(graph=sess.graph)
run_options = tf.compat.v1.RunOptions(trace_level=tf.compat.v1.RunOptions.FULL_TRACE)
# RunMetadatad
run_metadata = tf.compat.v1.RunMetadata()
# ProfileOptionBuilder
profile_op_opt_builder = option_builder.ProfileOptionBuilder(tf.compat.v1.profiler.ProfileOptionBuilder.time_and_memory())
profile_op_opt_builder.select(['micros', 'occurrence'])
profile_op_opt_builder.order_by('accelerator_micros')
profile_op_opt_builder.with_max_depth(100000)
profile_op_opt_builder.with_file_output('profiler_fileoutput_nvidia.txt')
# Run model
outs = sess.run(
model_outs,
options=run_options,
run_metadata=run_metadata
)
profiler.add_step(1, run_meta=run_metadata)
profiler.profile_operations(profile_op_opt_builder.build())
print('='*20+'Profile done!'+'='*20)
@ckluk Sorry I can not upload screenshot due to company's information security regulations, but the Profiler output is the raw output of TF profiler. I don't use tensorboard, only use tensorflow.python.profiler.model_analyzer.Profiler
API
@ckluk Ok, thank you for your advice. I have an other question about profiler: what's the relationship between end-to-end execution_time, cpu_execution_time, accelerator_execution_time when I profile a model on GPU.
cpu_execution_time and accelerator_execution_time can be obtained by Profiler
The end-to-end execution time is measured by: end-to-end execution time = t_end - t_start
t_start = time.time()
outs = model(dump_input)
... wait for GPU and CPU execution finish
t_end = time.time()
I have profiled several models, I have found that end-to-end execution time < cpu_execution_time + accelerator_execution_time always. I know that GPU compute asynchronously, GPU is mainly for Op compute and CPU is mainly for Op Schedule. I think end-to-end_execution_time = max(cpu_execution_time, accelerator_execution_time) should be correct. I am not very sure weather the conclusion is correct, because I have only observed this through experimantal phenomena, and I am not very clear about the principles of TF Profiler.