sparklyr / sparklyr

R interface for Apache Spark

Home Page:https://spark.rstudio.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Repeated calls to `mutate` and other functions take increasingly long to return

aguynamedryan opened this issue · comments

I'm suffering from a degradation of performance of sparklyr. I'm using sparklyr version 1.8.3 and Spark 3.4.0.

I'm building an ETL tool which ingests rows from a CSV file to determine what transformations are needed. Each source table is a tbl_spark from sparklyr and the tool uses dplyr to apply mutations/joins to that table for each tranformation. Essentially, for each source table, my ETL tool operates like this:

t <- tbl(sc, "source_table")

for(i in 1:nrow(csv_rows)) {
	t <- t |> left_join(other_table, ...)
	t <- t |> mutate(...)
}

t |> compute()

I'm finding that each subsequent call to mutate takes longer and longer to complete. In my ETL tool, it ends up taking 5-10 seconds for a call to mutate to complete. My ETL specifications contain thousands of transformations. Currently, my ETL tool takes ~8 hours to complete. Half of that time is Spark performing translations on the data. But the other half of that time is waiting for calls mutate/join to finish.

In profiling my code, it appears that mutate is spending most of its time rendering out the SQL for the tbl_spark and all the mutations/joins applied to it so far. I assume each subsequent call re-renders all of that SQL plus the additional SQL from the new call.

I have a simple script that illustrates the slow down. On my system, initial calls to mutate return in 0.1 seconds. By the end of 50 calls, mutate is taking 1.5 seconds to complete.

library(dplyr)
library(sparklyr)

sc <- spark_connect("local")
d <- data.frame(
  d01 = c(1), d02 = c(2), d03 = c(3), d04 = c(4), d05 = c(5), d06 = c(6), d07 = c(7), d08 = c(8), d09 = c(9), d10 = c(10)
  , d11 = c(1), d12 = c(2), d13 = c(3), d14 = c(4), d15 = c(5), d16 = c(6), d17 = c(7), d18 = c(8), d19 = c(9), d20 = c(10)
  , d21 = c(1), d22 = c(2), d23 = c(3), d24 = c(4), d25 = c(5), d26 = c(6), d27 = c(7), d28 = c(8), d29 = c(9), d30 = c(10)
)
copy_to(sc, d, name = "some_data", overwrite = TRUE)

ds <- tbl(sc, "some_data") |> rename_all(tolower)

for(i in 1:50) {
  new_col <- paste0("col", i)
  print(system.time({ds <- ds |> mutate("{new_col}" := i)}))
}
spark_disconnect(sc)

Output:

   user  system elapsed 
  0.059   0.000   0.120

...

   user  system elapsed 
  1.407   0.000   1.543

I have another script that illustrates the behavior I'm expecting. It uses RSQLite in place of sparklyr. On my system, initial calls to mutate return in 0.02 seconds. By the end of 50 calls, mutate is taking 0.02 seconds to complete. RSQLite does not seem to render out the SQL for each mutation applied.

library(dplyr)
library(RSQLite)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
d <- data.frame(
  d01 = c(1), d02 = c(2), d03 = c(3), d04 = c(4), d05 = c(5), d06 = c(6), d07 = c(7), d08 = c(8), d09 = c(9), d10 = c(10)
  , d11 = c(1), d12 = c(2), d13 = c(3), d14 = c(4), d15 = c(5), d16 = c(6), d17 = c(7), d18 = c(8), d19 = c(9), d20 = c(10)
  , d21 = c(1), d22 = c(2), d23 = c(3), d24 = c(4), d25 = c(5), d26 = c(6), d27 = c(7), d28 = c(8), d29 = c(9), d30 = c(10)
)
copy_to(con, d, name = "some_data", overwrite = TRUE)

ds <- tbl(con, "some_data")

for(i in 1:50) {
  new_col <- paste0("col", i)
  print(system.time({ds <- ds |> mutate("{new_col}" := i)}))
}

dbDisconnect(con)

Output:

   user  system elapsed 
  0.021   0.000   0.021

...

   user  system elapsed 
  0.020   0.000   0.021
> utils::sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSQLite_2.3.1  DBI_1.1.3      sparklyr_1.8.3 dplyr_1.1.3   

loaded via a namespace (and not attached):
 [1] bit_4.0.5         jsonlite_1.8.7    compiler_4.3.0    renv_0.14.0       tidyselect_1.2.0  blob_1.2.4        stringr_1.5.0     parallel_4.3.0    tidyr_1.3.0       yaml_2.3.7        uuid_1.1-1       
[12] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    tibble_3.2.1      openssl_2.1.0     pillar_1.9.0      rlang_1.1.1       utf8_1.2.3        cachem_1.0.8      stringi_1.7.12    config_0.3.2     
[23] bit64_4.0.5       memoise_2.0.1     cli_3.6.1         withr_2.5.0       magrittr_2.0.3    digest_0.6.33     rstudioapi_0.15.0 dbplyr_2.3.3      base64enc_0.1-3   rappdirs_0.3.3    askpass_1.2.0    
[34] lifecycle_1.0.3   vctrs_0.6.3       glue_1.6.2        fansi_1.0.4       purrr_1.0.2       httr_1.4.7        tools_4.3.0       pkgconfig_2.0.3   ellipsis_0.3.2

> devtools::session_info()
Error in loadNamespace(x) : there is no package called ‘devtools’