prestodb / RPresto

DBI-based adapter for Presto for the statistical programming language R.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fix translation of `fill()`

copernican opened this issue · comments

fill() doesn't work with Presto and needs a translation, description at tidyverse/dbplyr#1026. I submitted tidyverse/dbplyr#1065 and was asked to submit it here instead. Is it okay if I open a PR for this?

Sure. I can take a look at the PR once you're ready. Thanks!

fill() now uses the newly gained na_rm argument of last() so that last_value_sql() isn't needed anymore.
You can look at https://github.com/tidyverse/dbplyr/blob/main/R/backend-.R#L386 and https://github.com/tidyverse/dbplyr/blob/main/R/translate-sql-window.R#L255 for the implementation in dbplyr.

@jarodmeng in light of this change in dbplyr, do you prefer a PR that addresses only fill(), i.e., adds a translation for last(), or a PR similar to the sql_nth() implementation in dbplyr that can also handle first() and nth()?

I generated a simple reprex of the problem. The problem is although Presto supports IGNORE NULLS, the syntax is different (i.e. LAST_VALUE("x") IGNORE NULLS rather than LAST_VALUE("x" IGNORE NULLS)).

library(RPresto)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

# creating a default connection
conn <- presto_default()

# prepare the sample data
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
squirrels <- tibble::tribble(
  ~group,    ~name,     ~role,     ~n_squirrels, ~ n_squirrels2,
  1,      "Sam",    "Observer",   NA,                 1,
  1,     "Mara", "Scorekeeper",    8,                NA,
  1,    "Jesse",    "Observer",   NA,                NA,
  1,      "Tom",    "Observer",   NA,                 4,
  2,     "Mike",    "Observer",   NA,                NA,
  2,  "Rachael",    "Observer",   NA,                 6,
  2,  "Sydekea", "Scorekeeper",   14,                NA,
  2, "Gabriela",    "Observer",   NA,                NA,
  3,  "Derrick",    "Observer",   NA,                NA,
  3,     "Kara", "Scorekeeper",    9,                 10,
  3,    "Emily",    "Observer",   NA,                NA,
  3, "Danielle",    "Observer",   NA,                NA
)
squirrels$id <- 1:12

# copy the sample data to connection
if (dbExistsTable(conn, "squirrels")) { dbRemoveTable(conn, "squirrels") }
tbl.squirrels <- copy_to(conn, squirrels, "squirrels")

# calling fill() works
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
tbl.fill <- tbl.squirrels |>
  window_order(id) |>
  tidyr::fill(
    n_squirrels,
    n_squirrels2,
  )

# getting the data returns error
tbl.fill
#> Error: Query 20230315_025045_00031_rsvsy failed: line 5:28: mismatched input 'IGNORE'. Expecting: ',', <expression>

# the problem is IGNORE NULLS should be outside of the parenthesis
tbl.fill |> show_query()
#> <SQL>
#> SELECT
#>   "group",
#>   "name",
#>   "role",
#>   LAST_VALUE("n_squirrels" IGNORE NULLS) OVER (ORDER BY "id") AS "n_squirrels",
#>   LAST_VALUE("n_squirrels2" IGNORE NULLS) OVER (ORDER BY "id") AS "n_squirrels2",
#>   "id"
#> FROM "squirrels"

Created on 2023-03-15 with reprex v2.0.2

@jarodmeng in light of this change in dbplyr, do you prefer a PR that addresses only fill(), i.e., adds a translation for last(), or a PR similar to the sql_nth() implementation in dbplyr that can also handle first() and nth()?

I will add a PR to address first(), last(), and nth() translation for Presto and add unit tests. Thanks for flagging this!

I have a commit (jarodmeng@cc488fa) in my personal fork that's working now. I will wait until dbplyr 2.3.2 is officially released to create a PR on it.

library(RPresto)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

# creating a default connection
conn <- presto_default()

# prepare the sample data
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
squirrels <- tibble::tribble(
  ~group,    ~name,     ~role,     ~n_squirrels, ~ n_squirrels2,
  1,      "Sam",    "Observer",   NA,                 1,
  1,     "Mara", "Scorekeeper",    8,                NA,
  1,    "Jesse",    "Observer",   NA,                NA,
  1,      "Tom",    "Observer",   NA,                 4,
  2,     "Mike",    "Observer",   NA,                NA,
  2,  "Rachael",    "Observer",   NA,                 6,
  2,  "Sydekea", "Scorekeeper",   14,                NA,
  2, "Gabriela",    "Observer",   NA,                NA,
  3,  "Derrick",    "Observer",   NA,                NA,
  3,     "Kara", "Scorekeeper",    9,                 10,
  3,    "Emily",    "Observer",   NA,                NA,
  3, "Danielle",    "Observer",   NA,                NA
)
squirrels$id <- 1:12

# copy the sample data to connection
if (dbExistsTable(conn, "squirrels")) { dbRemoveTable(conn, "squirrels") }
tbl.squirrels <- copy_to(conn, squirrels, "squirrels")

# calling fill() works
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
tbl.fill <- tbl.squirrels |>
  window_order(id) |>
  tidyr::fill(
    n_squirrels,
    n_squirrels2,
  )

# getting the data now also works
tbl.fill
#> # Source:     SQL [?? x 6]
#> # Database:   PrestoConnection
#> # Ordered by: id
#>    group name     role        n_squirrels n_squirrels2    id
#>    <dbl> <chr>    <chr>             <dbl>        <dbl> <int>
#>  1     1 Sam      Observer             NA            1     1
#>  2     1 Mara     Scorekeeper           8            1     2
#>  3     1 Jesse    Observer              8            1     3
#>  4     1 Tom      Observer              8            4     4
#>  5     2 Mike     Observer              8            4     5
#>  6     2 Rachael  Observer              8            6     6
#>  7     2 Sydekea  Scorekeeper          14            6     7
#>  8     2 Gabriela Observer             14            6     8
#>  9     3 Derrick  Observer             14            6     9
#> 10     3 Kara     Scorekeeper           9           10    10
#> # … with more rows

# the query is correct
tbl.fill |> show_query()
#> <SQL>
#> SELECT
#>   "group",
#>   "name",
#>   "role",
#>   LAST_VALUE("n_squirrels") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels",
#>   LAST_VALUE("n_squirrels2") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels2",
#>   "id"
#> FROM "squirrels"

# debug info
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dbplyr_2.3.1.9000  dplyr_1.1.0        RPresto_1.4.4.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.8.1      compiler_4.2.1    highr_0.9         prettyunits_1.1.1
#>  [5] progress_1.2.2    R.methodsS3_1.8.2 R.utils_2.12.0    tools_4.2.1      
#>  [9] bit_4.0.5         digest_0.6.29     jsonlite_1.8.4    timechange_0.1.1 
#> [13] lubridate_1.9.0   evaluate_0.16     lifecycle_1.0.3   tibble_3.2.0     
#> [17] R.cache_0.16.0    pkgconfig_2.0.3   rlang_1.0.6       reprex_2.0.2     
#> [21] cli_3.6.0         DBI_1.1.3         rstudioapi_0.14   curl_4.3.3       
#> [25] yaml_2.3.5        xfun_0.33         fastmap_1.1.0     httr_1.4.4       
#> [29] withr_2.5.0       styler_1.7.0      stringr_1.5.0     knitr_1.40       
#> [33] hms_1.1.2         generics_0.1.3    fs_1.5.2          vctrs_0.5.2      
#> [37] bit64_4.0.5       tidyselect_1.2.0  glue_1.6.2        R6_2.5.1         
#> [41] fansi_1.0.4       rmarkdown_2.16    blob_1.2.3        tidyr_1.3.0      
#> [45] purrr_1.0.1       magrittr_2.0.3    ellipsis_0.3.2    htmltools_0.5.3  
#> [49] utf8_1.2.3        stringi_1.7.12    crayon_1.5.2      R.oo_1.25.0

Created on 2023-03-15 with reprex v2.0.2

For folks who want to use this hot fix before dbplyr 2.3.2 an RPresto 1.4.5 are officially released, you can install the development versions.

devtools::install_github("tidyverse/dbplyr")
devtools::install_github("jarodmeng/RPresto", ref = "fix_fill")

This issue is fixed in RPresto 1.4.6.

library(RPresto)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

# creating a default connection
conn <- presto_default()

# prepare the sample data
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
squirrels <- tibble::tribble(
  ~group,    ~name,     ~role,     ~n_squirrels, ~ n_squirrels2,
  1,      "Sam",    "Observer",   NA,                 1,
  1,     "Mara", "Scorekeeper",    8,                NA,
  1,    "Jesse",    "Observer",   NA,                NA,
  1,      "Tom",    "Observer",   NA,                 4,
  2,     "Mike",    "Observer",   NA,                NA,
  2,  "Rachael",    "Observer",   NA,                 6,
  2,  "Sydekea", "Scorekeeper",   14,                NA,
  2, "Gabriela",    "Observer",   NA,                NA,
  3,  "Derrick",    "Observer",   NA,                NA,
  3,     "Kara", "Scorekeeper",    9,                 10,
  3,    "Emily",    "Observer",   NA,                NA,
  3, "Danielle",    "Observer",   NA,                NA
)
squirrels$id <- 1:12

# copy the sample data to connection
if (dbExistsTable(conn, "squirrels")) { dbRemoveTable(conn, "squirrels") }
tbl.squirrels <- copy_to(conn, squirrels, "squirrels")

# calling fill() works
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
tbl.fill <- tbl.squirrels |>
  window_order(id) |>
  tidyr::fill(
    n_squirrels,
    n_squirrels2,
  )

# getting the data now also works
tbl.fill
#> # Source:     SQL [?? x 6]
#> # Database:   PrestoConnection
#> # Ordered by: id
#>    group name     role        n_squirrels n_squirrels2    id
#>    <dbl> <chr>    <chr>             <dbl>        <dbl> <int>
#>  1     1 Sam      Observer             NA            1     1
#>  2     1 Mara     Scorekeeper           8            1     2
#>  3     1 Jesse    Observer              8            1     3
#>  4     1 Tom      Observer              8            4     4
#>  5     2 Mike     Observer              8            4     5
#>  6     2 Rachael  Observer              8            6     6
#>  7     2 Sydekea  Scorekeeper          14            6     7
#>  8     2 Gabriela Observer             14            6     8
#>  9     3 Derrick  Observer             14            6     9
#> 10     3 Kara     Scorekeeper           9           10    10
#> # ℹ more rows

# the query is correct
tbl.fill |> show_query()
#> <SQL>
#> SELECT
#>   "group",
#>   "name",
#>   "role",
#>   LAST_VALUE("n_squirrels") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels",
#>   LAST_VALUE("n_squirrels2") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels2",
#>   "id"
#> FROM "squirrels"

# debug info
sessionInfo()
#> R Under development (unstable) (2023-05-02 r84382)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS 14.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Asia/Singapore
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dbplyr_2.4.0  dplyr_1.1.2   RPresto_1.4.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5         jsonlite_1.8.4    compiler_4.4.0    crayon_1.5.2     
#>  [5] reprex_2.0.2      tidyselect_1.2.0  blob_1.2.4        tidyr_1.3.0      
#>  [9] progress_1.2.2    yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#> [13] generics_0.1.3    curl_5.0.0        knitr_1.43        tibble_3.2.1     
#> [17] lubridate_1.9.2   DBI_1.1.3         pillar_1.9.0      rlang_1.1.1      
#> [21] utf8_1.2.3        xfun_0.39         fs_1.6.2          bit64_4.0.5      
#> [25] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
#> [29] digest_0.6.31     rstudioapi_0.14   hms_1.1.3         lifecycle_1.0.3  
#> [33] prettyunits_1.1.1 vctrs_0.6.5       evaluate_0.20     glue_1.6.2       
#> [37] fansi_1.0.4       rmarkdown_2.21    purrr_1.0.1       httr_1.4.5       
#> [41] tools_4.4.0       pkgconfig_2.0.3   htmltools_0.5.5

Created on 2024-01-26 with reprex v2.0.2