JerryLead / SparkInternals

Notes talking about the design and implementation of Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Narrow dependencies-第二章第二节图FullDependency: N : N

feitang0 opened this issue · comments

Narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD
第二章第二节中 FullDependency: N : N 那张图, 父RDD中的一个分区被子RDD的两个分区依赖, 不能被称为Narrow Dependency吧, 为啥说FullDenpency是NarrowDepency呢?

commented

+1

commented

+1

commented

+1

Narrow指的是完全依赖,parentRDD中每个p中的数据不需要再进行partition后发给childRDD。下面的cartesian(otherRDD)展示了N:N的Narrow Dependency,整个计算过程不需要shuffle。

@JerryLead
个人觉得这里narrow vs. wide定义不是很清楚,感觉作者原意是想把确定的和随机的分开,所以如果中间有shuffle操作则为wide,否则为narrow。其实际的含义是确定和不确定的区别(即给定一个子partition,其父partition是完全确定的),而不是full还是partial。尤其,"essentially" 建议改为"typically",否则意思上也是有自相矛盾的地方。