JerryLead / SparkInternals

Notes talking about the design and implementation of Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why the definition of dependencies is different from RDD paper?

endersuu opened this issue · comments

From the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

  • narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD
  • wide dependencies, where multiple child partitions may depend on it

However, The definition of dependencies from the chapter JobLogicalPlan is different :

  • NarrowDependency, Each partition of the child RDD fully depends on a small number of partitions of its parent RDD. Fully depends (i.e., FullDependency) means that a child partition depends the entire parent partition.

  • ShuffleDependency, Multiple child partitions partially depends on a parent partition. Partially depends (i.e., PartialDependency) means that each child partition depends a part of the parent partition.

This makes me really confused. Are ShuffleDependency and wide dependency the same thing?