Lengthy run-time for lab-02-tfx-pipeline

Question

Lengthy run-time for lab-02-tfx-pipeline

dougkelly opened this issue 4 years ago · comments

Ran through lab-02-tfx-pipeline 3 times with the following run-times:

1 hrs 46 min
2 hrs 6 min
1 hour 54 min

I was a bit concerned by this runtime length on a small dataset (~500k examples) for delivery and motivating the use of CAIP pipelines compared to existing CAIP training and prediction services so wanted to flag it and discuss improvement opportunities.

Lot of deprecation warnings and non-fatal errors in the log. I am still learning the KFP interface compared to SmartEngine UI so wasn't sure how to view the runtimes of individual components to profile. From what I can tell, the ordering based on wall time is Trainer > Evaluator > Transform > CsvExampleGen.

To improve performance, are there opportunities to:

Add additional worker machines / accelerator (GPU, TPU) to Trainer?
Add additional worker machines to Evaluator?

I see the GKE cluster created has 2 nodes with autoscaling on for up to 5. Looks like the cluster was well within memory and CPU limits but one of the nodes did have an autoscaler pod run. This guide https://cloud.google.com/ai-platform/pipelines/docs/configure-gke-cluster?hl=en_US#ensure mentions having at least 3 nodes (+1 node) with 2 CPUs with 4GB (+1GB each) memory. Perhaps mirroring this config and allocating more resources upfront would yield performance benefits?

Jarek Kazmierczak · Answer 1 · Fri Apr 24 2020 01:04:39 GMT+0800 (China Standard Time)

Increasing the capacity of the cluster hosting KFP will not help. Most of the time is spent in starting and executing Dataflow and AI Platform Pipelines jobs. The long execution time for AI Platform training does not look right.

Jarek Kazmierczak · Answer 2 · Sun Apr 26 2020 07:16:39 GMT+0800 (China Standard Time)

I have fixed the slow training in AI Platform Training. The issue was a Tensorboard callback that was writing to GCS. The next step is to look at optimizing Dataflow, which most likely will be just an incremental change

Doug Kelly · Answer 3 · Wed Apr 29 2020 07:54:09 GMT+0800 (China Standard Time)

Thanks, excellent detective work and fix! I just ran through the lab again and saw a significant improvement in runtimes of 0:45:28 and 0:46:12 (last one even added InfraValidator component to DAG as well). This is great and gives us a lot more flexibility to re-use this lab scoped to take about an hour end-2-end (reading + explanation + pipeline runtime) in a number of different training formats going forward e.g. workshops, ASL, future Qwiklab conversion, etc. Marking this as successfully fixed.