Spark crasher

JohnMount opened this issue · comments

I can reliably crash Spark using Sparklyr/dplyr commands. The example is reproducible, but long so I am linking to it here (and source).

The symptom is the Spark cluster becomes unresponsive (with high CPU and no longer updates the web user interface) and this sometimes causes RStudio to crash-out. The failure is not fully deterministic (it can take 1 to 3 passes through the loop to trigger).

It (unfortunately) involves a complicated calculation (originally from another package, but now completely stand-alone in the example).

I am attaching clean logs of just the failure here and here.

This is expensive for me to re-run as I have to re-boot to try and make sure I have clean interprocess communication state after each failure. I suggest a good first step is reproducibility: do you see the same effects?

Pasted version of example (skipping executing the crashing block- which is the only for-loop.


    ## [1] "Sat Jun  3 15:19:28 2017"


    ## [1] '0.5.0'


    ## [1] '0.5.5'


sc <- NULL

sc <- sparklyr::spark_connect(version='2.0.2', 
                              master = "local")


mtcars2 <- mtcars %>%
  mutate(car = row.names(mtcars)) 

frameList <- mtcars2 %>% 
  tidyr::gather(key='fact', value='value', -car) %>% 
  split(., .$fact) 

frameListS <- lapply(names(frameList), 
                     function(ni) {
                       copy_to(sc, frameList[[ni]], ni)
names(frameListS) <- names(frameList)

n1 <- names(frameListS)[[1]]
nrest <- setdiff(names(frameListS),n1)

#' Compute union_all of tables.  Cut down from \code{replyr::replyr_union_all()} for debugging.
#' @param sc remote data source tables are on (and where to copy-to and work), NULL for local tables.
#' @param tabA not-NULL table with at least 1 row on sc data source, and columns \code{c("car", "fact", "value")}.
#' @param tabB not-NULL table with at least 1 row on same data source as tabA and columns \code{c("car", "fact", "value")}.
#' @return table with all rows of tabA and tabB (union_all).
#' @export
example_union_all <- function(sc, tabA, tabB) {
  cols <- intersect(colnames(tabA), colnames(tabB))
  expectedCols <- c("car", "fact", "value")
  if((length(cols)!=length(expectedCols)) ||
     (!all.equal(cols, expectedCols))) {
    stop(paste("example_union_all: column set must be exactly", 
               paste(expectedCols, collapse = ', ')))
  mergeColName <- 'exampleunioncol'
  # build a 2-row table to control the union
  controlTable <- data.frame(exampleunioncol= c('a', 'b'),
                             stringsAsFactors = FALSE)
  if(!is.null(sc)) {
    controlTable <- copy_to(sc, controlTable,
  # decorate left and right tables for the merge
  tabA <- tabA %>%
    select(one_of(cols)) %>%
    mutate(exampleunioncol = as.character('a'))
  tabB <- tabB %>%
    select(one_of(cols)) %>%
    mutate(exampleunioncol = as.character('b'))
  # do the merges
  joined <- controlTable %>%
    left_join(tabA, by=mergeColName) %>%
    left_join(tabB, by=mergeColName, suffix = c('_a', '_b'))
  # coalesce the values
  joined <- joined %>%
    mutate(car = ifelse(exampleunioncol=='a', car_a, car_b))
  joined <- joined %>%
    mutate(fact = ifelse(exampleunioncol=='a', fact_a, fact_b))
  joined <- joined %>%
    mutate(value = ifelse(exampleunioncol=='a', value_a, value_b))
  joined %>%
    select(one_of(cols)) %>%

# skipped executing this.  This triggers the lock-up, crash.
for(i in seq_len(100)) { 
  # very crude binding of rows (actual code would always bind small bits)
  res <- frameListS[[n1]]
  for(fi in nrest) {
    print(paste(' start',i,fi,base::date()))
    oi <- frameListS[[fi]]
    res <- example_union_all(sc, res, oi)
    print(paste(' done',i,fi,base::date()))
  local <- res %>%
    collect() %>%
  print(paste(' done',i,base::date()))

if(!is.null(sc)) {

Note: I am deliberately leaking resources in this example (as I was worried the crasher might be premature resource collection). Also this example originally used the package replyr to generate the dplyr/sparklyr commands, this is a "cleaned up example that tries to imitate the issue I originally ran into. The replyr version of the issue can be found here.

The segfault version (intermittent failure) is here.

Found out in this use case, since compute() does currently does not persist a snapshot of the data (proposes PR fixes this), Spark ends up creating a giant graph of operations that starts looking like this:

screen shot 2017-06-21 at 3 45 42 pm

Spark is not optimized for thousands of operations in the execution plan, GC starts getting saturated and not common exceptions are thrown out...

screen shot 2017-06-21 at 3 35 56 pm

The proposed PR mitigates this by making compute() persist the query, not sure if it will fully satisfy this use case, but certainly the behavior makes it closer what a dplyr user would expect from compute().

Haven't looked deeply at the reprex but if this is an issue of lineage blowing up consider calling sdf_checkpoint()?

Thanks, to use it I'd need to know what sdf_checkpoint() does and how to delete it after I am done with it (to prevent leaks, it doesn't appear to take or return a name- so that is a problem).

@kevinykuo @javierluraschi

First: thanks for working on this! This sort of workload is probably going to evolve into my "Proof of Concept" small difficult example (run this before trying big projects on a new cluster).

Unfortunately, I do have a couple of follow-ups:

  1. Can you point me to some documentation for sdf_checkpoint. Mostly what does it do, and given it doesn't take a name argument how does one delete its results and prevent resource leaks?

  2. I pulled the current development code (June 25, 2017) and still see the GC exhaustion behavior on a slightly modified example. I have taken the liberty of opening that as new issue: SparklyR issue 783. Since it is a crasher the example isn't formally produced by the reprex package, but is code that can be pasted and run.

@JohnMount sdf_checkpoint calls checkpoint which is meant to be used for iterative algorithms like yours, curious to understand if replacing compute() with sdf_checkpoint() fixes your scenario.

I'll take a look at (2), already posting a suggestion.

I'll try to take a look late tonight. Thank you very much for the suggestions and ideas. The real application is a set of steps I have control of- so a work around would be just fine.

So how does one set the Checkpoint directory? help(sdf_checkpoint) did not seem to give me a pointer.

z = sdf_checkpoint(d)
Error: org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext
	at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1543)
	at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:513)

@MZLABS @JohnMount See ?spark_set_checkpoint_dir

We'll make a note to include this when we update the documentation next release.

This was never added to the documentation for sdf_checkpoint.