Wrong X axis on profiler's Step-time Graph

Question

Wrong X axis on profiler's Step-time Graph

andreykramer opened this issue 3 years ago · comments

Based on the tensorflow guide "writing a training loop from scratch" I've created a reproducible example (see bottom) to show that I can't seem to get the step number on the Step-time graph right. I adapted this example on tf.profiler.experimental.Trace to trace steps [20,29] on my training loop. The trace is correct:

But on the step-time graph in the overview page, the range of the X axis is [1,8]:

It's even worse on the actual code I'm trying to get the profiler working on, where I try to trace in a similar way steps [20,29] but the resulting step-time graph looks like this:

Am I getting something wrong? Where does [1,8] range come from?

Thank you in advance.

Here's the code for the reproducible example:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import os

summary_dir = "./summaries/train"
os.makedirs(summary_dir, exist_ok=True)
summary_writer = tf.summary.create_file_writer(summary_dir)

"""
## Low-level handling of metrics
Let's add metrics monitoring to this basic loop.
You can readily reuse the built-in metrics (or custom ones you wrote) in such training
loops written from scratch. Here's the flow:
- Instantiate the metric at the start of the loop
- Call `metric.update_state()` after each batch
- Call `metric.result()` when you need to display the current value of the metric
- Call `metric.reset_states()` when you need to clear the state of the metric
(typically at the end of an epoch)
Let's use this knowledge to compute `SparseCategoricalAccuracy` on validation data at
the end of each epoch:
"""

# Get model
inputs = keras.Input(shape=(784,), name="digits")
x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
x = layers.Dense(64, activation="relu", name="dense_2")(x)
outputs = layers.Dense(10, name="predictions")(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Instantiate an optimizer to train the model.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
train_dataset = iter(train_dataset)


# Prepare the metrics.
train_acc_metric = keras.metrics.SparseCategoricalAccuracy()
val_acc_metric = keras.metrics.SparseCategoricalAccuracy()

"""
## Speeding-up your training step with `tf.function`
The default runtime in TensorFlow 2.0 is
[eager execution](https://www.tensorflow.org/guide/eager). As such, our training loop
above executes eagerly.
This is great for debugging, but graph compilation has a definite performance
advantage. Describing your computation as a static graph enables the framework
to apply global performance optimizations. This is impossible when
the framework is constrained to greedly execute one operation after another,
with no knowledge of what comes next.
You can compile into a static graph any function that takes tensors as input.
Just add a `@tf.function` decorator on it, like this:
"""


@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value

"""
Now, let's re-run our training loop with this compiled training step:
"""
# Iterate over the batches of the dataset.
for step in range(100):
    if step == 20:
        tf.profiler.experimental.start(summary_dir)
    elif step == 30:
        tf.profiler.experimental.stop()
    with tf.profiler.experimental.Trace('train', step_num=step, _r=1):
        x_batch_train, y_batch_train = next(train_dataset)

        loss_value = train_step(x_batch_train, y_batch_train)

        print(
            "Step %d:   Loss     %.4f"
            % (step, float(loss_value))
        )


# Display metrics at the end of each epoch.
train_acc = train_acc_metric.result()
print("Training acc over epoch: %.4f" % (float(train_acc),))

# Reset training metrics at the end of each epoch
train_acc_metric.reset_states()

Francesco Fantauzzi · Answer 1 · Tue Feb 09 2021 05:14:11 GMT+0800 (China Standard Time)

Found the same with no need for a custom loop, just calling fit() on a Keras model, like:

    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
                                                          histogram_freq=1,
                                                          profile_batch=1)

    history = model.fit(dataset_train,
                        epochs=epochs,
                        validation_data=dataset_val,
                        shuffle=False,
                        callbacks=[tensorboard_callback],
                        verbose=1)

The result in Tensorboard PROFILE should be a Step-time Graph with only one step, instead getting a chart with steps from 18 to 105. See screeshot below. I will see if I can reduce the code to a smaller sample that still reproduces the issue.

[CORRECTION] In the meanwhile, I have downloaded and run locally the Colab sample, and it DID reproduce the issue https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras

Using TF 2.4.1 and TB 2.4.1 installed with pip under Ubuntu 20.04 with CUDA 11.0.

Francesco Fantauzzi · Answer 2 · Tue Feb 09 2021 18:55:25 GMT+0800 (China Standard Time)

The problem can be reproduced running the Colab example https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras

Steps to reproduce:

Open the Colab example
Runtime > Run all
After execution has completed, scroll down to either visualization of Tensorboard
In Tensorboard, select PROFILE (may have to find it under INACTIVE)
From the Step-time Graph of PROFILE, open the Run (2) drop-down menu, and select the second run in the list (which actually was the first to run)
The Step-time Graph is wrong, with incorrect step numbers

The difference between the two runs is in the Dataset pipeline. The first run, which shows the issue, builds the pipeline like:

ds_train = ds_train.map(normalize_img)
ds_train = ds_train.batch(128)

I have found that, if I add a cache() operation after the map() or the batch(), then the Step-time Graph in Tensorboard seems correct.

Alexander Grund · Answer 3 · Wed Apr 21 2021 18:01:41 GMT+0800 (China Standard Time)

I'm seeing the same. Using a profile_batch of 50,60 shows step numbers of 199,392,583,778,973,1172,1333,1554,1741

Those numbers match the profile steps (9) but the values are completely random and vary from run to run

punitsingh2009 · Answer 4 · Tue Apr 26 2022 18:52:47 GMT+0800 (China Standard Time)

I can also see the same inconsistency in the step number on the overview page for my model code. It is custom training loop similar to @andreykramer

I used the tf.profiler.experimental.Trace API to trace steps [48, 56) of my training loop. The trace events are correct, but the overview page graph is not. Also, if I change the number of GPUs then the step numbers on the overview page graph also change. Using TF 2.5.0 and CUDA 11.2.

Liliane Neves · Answer 5 · Thu May 12 2022 02:01:54 GMT+0800 (China Standard Time)

I'm having the same issue. I'm also using tf.profiler.experimental to trace a script and the step numbers that appear seem really random.

I tried to run the example in https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras and had the same problem