sparklyr / sparklyr

R interface for Apache Spark

Home Page:https://spark.rstudio.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spark_apply() error, "Can't query fields"

brianstamper opened this issue · comments

Attempting to use spark_apply but getting an error. This is on a Cloudera Machine Learning environment, with the same error appearing in either Spark 2.4.7 or 3.2.1. On another machine with a local Spark install I do not see this issue, but the error I'm getting here is not giving much indication what the problem is.

For a small demo I'll use the first spark_apply() example from sparklyr - Distributing R Computations

library(tidyverse)
library(sparklyr)

conf <- spark_config()
sc <- spark_connect(master = 'yarn-client', config = conf)

sdf_len(sc, 5, repartition = 1) %>%
  spark_apply(function(e) I(e))

And the error I get looks like the following. It does make me wonder if this is actually a dbplyr issue instead.

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error in `value[[3L]]()`:
! Failed to fetch data: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 588.0 failed 4 times, most recent failure: Lost task 0.3 in stage 588.0 (TID 57382) (10.42.5.51 executor 112): org.apache.spark.SparkException: Process List(tar, -xf, packages.51962.tar) exited with code 2
Backtrace:
  1. sdf_len(sc, 5, repartition = 1) %>% ...
  4. sparklyr::sdf_len(sc, 5, repartition = 1)
  5. sparklyr::sdf_seq(sc, 1, length, repartition = repartition, type = type)
  7. sparklyr:::sdf_register.spark_jobj(sdf)
  9. sparklyr:::tbl.spark_connection(sc, name)
 10. sparklyr:::spark_tbl_sql(src = src, from)
 11. dbplyr::tbl_sql(...)
 13. dbplyr:::dbplyr_query_fields(src$con, from)
 14. dbplyr:::dbplyr_fallback(con, "db_query_fields", ...)
 16. dbplyr:::db_query_fields.DBIConnection(con, ...)

Cross posting from https://community.rstudio.com/t/using-spark-apply-throws-a-sparkexception-process-list-tar-xf-packages-75547-tar-exited-with-code-2/180555