horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Horovod on spark>=2.4 Barrier Execution Mode supporting

max-509 opened this issue · comments

Is your feature request related to a problem? Please describe.
When I use Horovod on spark for training distributed DL model, Horovod does some additional actions for data transferring to Horovod processes:

  • DataFrame partitions saving to some distributed storage using Petastorm (for example, HDFS)
  • Partitions reading from this storage for data delivering to Horovod processes using client (for example, hadoop).
    So this actions can decrease processing speed when we work with big data.

Describe the solution you'd like
I suggest use Barrier Execution Mode that was introduced in spark 2.4 version. Horovod can repartition Dataframe to number of executors and use mapInPandas() for conversion Spark DataFrame partition representation to iterator of pd.Dataframe. Arrow enabling will increase conversation speed. Iterator of pd.Dataframe Horovod can convert to specific DL framework dataloader. This logic can be wrapped into [Torch|Keras|Lighting]Estimator or by adding special function horovod.spark.run_on_dataframe() like horovod.spark.run()

Describe alternatives you've considered
As I understand, Databricks uses similar design for HorovodRunner. And XGBoost on Pyspark uses similar approach

Additional context
This feature can increase Horovod on spark popularity. So, in this presentation Uber engineers describe this problem.
If you support this idea, but you don’t have time to implement it, I can start implementation it as a contribution to horovod.

I'll be waiting for your feedback.