关于CogroupRDD的一点疑问以及依赖的一点问题

Question

关于CogroupRDD的一点疑问以及依赖的一点问题

pzz2011 opened this issue 8 years ago · comments

我看CogroupRDD的实现，没看懂narrowdependency或shuffledependency对cogrouprdd中partition的影响... 不知道如果a.cogroup(b) ， a分别是rangepartitioner和hashpartitioner的话，中间生成的cogrouprdd的分区数莫非和rdd a的一样多？因为cogroup这个算子不能指定numPartitons呀
我看您在JobLogicalPlan章节中对dependency分了4类（或者说两打类), 而且看cogroupRDD的对于依赖的处理，似乎并没有这么复杂，完全无视了所谓的N:1 NarrowDependency。

override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
val sparkConf = SparkEnv.get.conf
val externalSorting = sparkConf.getBoolean("spark.shuffle.spill", true)
val split = s.asInstanceOf[CoGroupPartition]
val numRdds = dependencies.length

// A list of (rdd iterator, dependency number) pairs
val rddIterators = new ArrayBuffer[(Iterator[Product2[K, Any]], Int)]
for ((dep, depNum) <- dependencies.zipWithIndex) dep match {
  case oneToOneDependency: OneToOneDependency[Product2[K, Any]] @unchecked =>
    val dependencyPartition = split.narrowDeps(depNum).get.split
    // Read them from the parent
    val it = oneToOneDependency.rdd.iterator(dependencyPartition, context)
    rddIterators += ((it, depNum))

  case shuffleDependency: ShuffleDependency[_, _, _] =>
    // Read map outputs of shuffle
    val it = SparkEnv.get.shuffleManager
      .getReader(shuffleDependency.shuffleHandle, split.index, split.index + 1, context)
      .read()
    rddIterators += ((it, depNum))
}

leo · Answer 1 · Thu Jun 23 2016 09:52:14 GMT+0800 (China Standard Time)

cogroup有一个可选参数指定task数，这个task数是不是就是partition数呢？

Resemble · Answer 2 · Sun Aug 23 2020 15:55:24 GMT+0800 (China Standard Time)

我看CogroupRDD的实现，没看懂narrowdependency或shuffledependency对cogrouprdd中partition的影响... 不知道如果a.cogroup(b) ， a分别是rangepartitioner和hashpartitioner的话，中间生成的cogrouprdd的分区数莫非和rdd a的一样多？因为cogroup这个算子不能指定numPartitons呀
我看您在JobLogicalPlan章节中对dependency分了4类（或者说两打类), 而且看cogroupRDD的对于依赖的处理，似乎并没有这么复杂，完全无视了所谓的N:1 NarrowDependency。
override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
val sparkConf = SparkEnv.get.conf
val externalSorting = sparkConf.getBoolean("spark.shuffle.spill", true)
val split = s.asInstanceOf[CoGroupPartition]
val numRdds = dependencies.length
// A list of (rdd iterator, dependency number) pairs
val rddIterators = new ArrayBuffer[(Iterator[Product2[K, Any]], Int)]
for ((dep, depNum) <- dependencies.zipWithIndex) dep match {
  case oneToOneDependency: OneToOneDependency[Product2[K, Any]] @unchecked =>
    val dependencyPartition = split.narrowDeps(depNum).get.split
    // Read them from the parent
    val it = oneToOneDependency.rdd.iterator(dependencyPartition, context)
    rddIterators += ((it, depNum))

  case shuffleDependency: ShuffleDependency[_, _, _] =>
    // Read map outputs of shuffle
    val it = SparkEnv.get.shuffleManager
      .getReader(shuffleDependency.shuffleHandle, split.index, split.index + 1, context)
      .read()
    rddIterators += ((it, depNum))
}

hi，你看懂了吗，我也不懂这个，为什么分区器相同就是一对一依赖，并且还可能产生 shuffle（因为划分了 stage）