GoogleCloudPlatform / mlops-on-gcp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lengthy run-time for lab-02-tfx-pipeline

dougkelly opened this issue · comments

Ran through lab-02-tfx-pipeline 3 times with the following run-times:

  • 1 hrs 46 min
  • 2 hrs 6 min
  • 1 hour 54 min

I was a bit concerned by this runtime length on a small dataset (~500k examples) for delivery and motivating the use of CAIP pipelines compared to existing CAIP training and prediction services so wanted to flag it and discuss improvement opportunities.

Lot of deprecation warnings and non-fatal errors in the log. I am still learning the KFP interface compared to SmartEngine UI so wasn't sure how to view the runtimes of individual components to profile. From what I can tell, the ordering based on wall time is Trainer > Evaluator > Transform > CsvExampleGen.

To improve performance, are there opportunities to:

  • Add additional worker machines / accelerator (GPU, TPU) to Trainer?
  • Add additional worker machines to Evaluator?

I see the GKE cluster created has 2 nodes with autoscaling on for up to 5. Looks like the cluster was well within memory and CPU limits but one of the nodes did have an autoscaler pod run. This guide https://cloud.google.com/ai-platform/pipelines/docs/configure-gke-cluster?hl=en_US#ensure mentions having at least 3 nodes (+1 node) with 2 CPUs with 4GB (+1GB each) memory. Perhaps mirroring this config and allocating more resources upfront would yield performance benefits?

Increasing the capacity of the cluster hosting KFP will not help. Most of the time is spent in starting and executing Dataflow and AI Platform Pipelines jobs. The long execution time for AI Platform training does not look right.

I have fixed the slow training in AI Platform Training. The issue was a Tensorboard callback that was writing to GCS. The next step is to look at optimizing Dataflow, which most likely will be just an incremental change

Thanks, excellent detective work and fix! I just ran through the lab again and saw a significant improvement in runtimes of 0:45:28 and 0:46:12 (last one even added InfraValidator component to DAG as well). This is great and gives us a lot more flexibility to re-use this lab scoped to take about an hour end-2-end (reading + explanation + pipeline runtime) in a number of different training formats going forward e.g. workshops, ASL, future Qwiklab conversion, etc. Marking this as successfully fixed.