[Question] expected throughput of cloud tpu on embedding lookup?

Question

[Question] expected throughput of cloud tpu on embedding lookup?

pmixer opened this issue 2 years ago · comments

(duplicated question as in tensorflow/recommenders#579, just not sure which repo is a better place for this kind of question)
Hi,

I read this blog recently https://cloud.google.com/blog/topics/developers-practitioners/building-large-scale-recommenders-using-cloud-tpus, very interested in it and wondering the raw performance of TPUEmbedding lookup performance.(we can quite easily get the perf data of tf.nn.(safe)embedding_lookup(_sparse) etc. but it becomes harder to get TPUEmbedding lookup perf data)

Based the test script included in this repo, I wrote piece of benchmarking code to test it:

# Copyright 2022 The TensorFlow Recommenders Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/layers/embedding/tpu_embedding_layer_test.py
# trying to test tpu embedding lookup throughput, while I could not find lower-level API for doing such test

import time
import numpy as np
import tensorflow as tf

from tensorflow_recommenders.layers.embedding import tpu_embedding_layer

TABLE_SIZE = 1000000
EMB_DIM = 128
QUERY_KEY_NUM = 65536 * 8 # * 64, killed as Allocation of xxx exceeds 10% of free system memory.

class TPUEmbeddingLayerTest():

  def __init__(self):

    self.embedding_values = np.arange(TABLE_SIZE * EMB_DIM, dtype=np.float32)
    self.initializer = tf.constant_initializer(self.embedding_values)

    self.table_config = tf.tpu.experimental.embedding.TableConfig(
                                            vocabulary_size=TABLE_SIZE,
                                            dim=EMB_DIM,
                                            initializer=self.initializer,
                                            combiner='sum',
                                            name='embedding_table')

    self.feature_config = {
        'indices2embeddings': tf.tpu.experimental.embedding.FeatureConfig(
            table=self.table_config, name='indices2embeddings'),
    }

    self.batch_size = QUERY_KEY_NUM
    self.sample_size = 1

    # TODO(pehuang): draw samples randomly by given distribution
    self.data_point_indices = np.zeros((self.batch_size, 2), dtype=np.int32)
    self.data_point_indices[:, 0] = np.arange(self.batch_size, dtype=np.int32)
    self.data_points = np.random.choice(TABLE_SIZE, QUERY_KEY_NUM)

    self.embedding_lookup_input_data = tf.SparseTensor(
        indices=self.data_point_indices,
        values=tf.convert_to_tensor(self.data_points, dtype=tf.int32), # fp64 embedding and int32 key by default?
        dense_shape=[self.batch_size, self.sample_size])

    self.dataset = tf.data.Dataset.from_tensors({'indices2embeddings': self.embedding_lookup_input_data})

  def embedding_lookup_throughput_test(self, optimizer_name='sgd', training=False):
    # resolver = tf.distribute.cluster_resolver.TPUClusterResolver('').connect('')
    # strategy = tf.distribute.TPUStrategy(resolver)
    strategy = tf.distribute.get_strategy() # Use the default strategy.

    with strategy.scope():
      embedding_layer = tpu_embedding_layer.TPUEmbedding(feature_config=self.feature_config, optimizer=None)
      input_args = {'batch_size': self.batch_size,
                    'shape': (),
                    'sparse': True,
                    'dtype': tf.int32}
      inputs = {'indices2embeddings': tf.keras.Input(**input_args, name='indices2embeddings')}
      embeddings = embedding_layer(inputs)
      self.model = tf.keras.Model(inputs=(inputs), outputs=(embeddings))

      dist = strategy.experimental_distribute_dataset(self.dataset, options=tf.distribute.InputOptions(experimental_fetch_to_device=False))
      dist_iter = iter(dist)

      def lookup(features):
        res = self.model(features)
        return res

      #  for _ in range(10): # warmup 10 rounds failed after batch data used up, and raised StopIteration exception
      #  result = strategy.run(lookup, args=(next(dist_iter),))

      t_start = time.time()
      result = strategy.run(lookup, args=(next(dist_iter),))
      t_end = time.time()
      #  import pdb; pdb.set_trace() # stop to check embeddings
      print("embedding throughput: {}GB/s".format((QUERY_KEY_NUM * EMB_DIM) / 1e9 / (t_end - t_start)))
      # embedding throughput: 0.04273784880155698GB/s, must be cold-start, D2H and H2D etc. included
      # but I cannot warmup/increase table size temporally, using cloud shell provided tpu


if __name__ == '__main__':
  test = TPUEmbeddingLayerTest()
  test.embedding_lookup_throughput_test()

I used cloud shell to get access to tpu resource as introduced in https://github.com/tensorflow/tpu, but the reported throughput data is quite below expectation(obviously, its due to various issues in my benchmark script), may anyone help correction the script to get benchmark results in the right way? Or just provide some reference data, so I could know what the expected throughput should be?

THX!