fix translation of `fill()`
copernican opened this issue · comments
fill()
doesn't work with Presto and needs a translation, description at tidyverse/dbplyr#1026. I submitted tidyverse/dbplyr#1065 and was asked to submit it here instead. Is it okay if I open a PR for this?
Sure. I can take a look at the PR once you're ready. Thanks!
fill()
now uses the newly gained na_rm
argument of last()
so that last_value_sql()
isn't needed anymore.
You can look at https://github.com/tidyverse/dbplyr/blob/main/R/backend-.R#L386 and https://github.com/tidyverse/dbplyr/blob/main/R/translate-sql-window.R#L255 for the implementation in dbplyr.
@jarodmeng in light of this change in dbplyr, do you prefer a PR that addresses only fill()
, i.e., adds a translation for last()
, or a PR similar to the sql_nth()
implementation in dbplyr that can also handle first()
and nth()
?
I generated a simple reprex of the problem. The problem is although Presto supports IGNORE NULLS
, the syntax is different (i.e. LAST_VALUE("x") IGNORE NULLS
rather than LAST_VALUE("x" IGNORE NULLS)
).
library(RPresto)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
# creating a default connection
conn <- presto_default()
# prepare the sample data
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
squirrels <- tibble::tribble(
~group, ~name, ~role, ~n_squirrels, ~ n_squirrels2,
1, "Sam", "Observer", NA, 1,
1, "Mara", "Scorekeeper", 8, NA,
1, "Jesse", "Observer", NA, NA,
1, "Tom", "Observer", NA, 4,
2, "Mike", "Observer", NA, NA,
2, "Rachael", "Observer", NA, 6,
2, "Sydekea", "Scorekeeper", 14, NA,
2, "Gabriela", "Observer", NA, NA,
3, "Derrick", "Observer", NA, NA,
3, "Kara", "Scorekeeper", 9, 10,
3, "Emily", "Observer", NA, NA,
3, "Danielle", "Observer", NA, NA
)
squirrels$id <- 1:12
# copy the sample data to connection
if (dbExistsTable(conn, "squirrels")) { dbRemoveTable(conn, "squirrels") }
tbl.squirrels <- copy_to(conn, squirrels, "squirrels")
# calling fill() works
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
tbl.fill <- tbl.squirrels |>
window_order(id) |>
tidyr::fill(
n_squirrels,
n_squirrels2,
)
# getting the data returns error
tbl.fill
#> Error: Query 20230315_025045_00031_rsvsy failed: line 5:28: mismatched input 'IGNORE'. Expecting: ',', <expression>
# the problem is IGNORE NULLS should be outside of the parenthesis
tbl.fill |> show_query()
#> <SQL>
#> SELECT
#> "group",
#> "name",
#> "role",
#> LAST_VALUE("n_squirrels" IGNORE NULLS) OVER (ORDER BY "id") AS "n_squirrels",
#> LAST_VALUE("n_squirrels2" IGNORE NULLS) OVER (ORDER BY "id") AS "n_squirrels2",
#> "id"
#> FROM "squirrels"
Created on 2023-03-15 with reprex v2.0.2
@jarodmeng in light of this change in dbplyr, do you prefer a PR that addresses only
fill()
, i.e., adds a translation forlast()
, or a PR similar to thesql_nth()
implementation in dbplyr that can also handlefirst()
andnth()
?
I will add a PR to address first()
, last()
, and nth()
translation for Presto and add unit tests. Thanks for flagging this!
I have a commit (jarodmeng@cc488fa) in my personal fork that's working now. I will wait until dbplyr 2.3.2 is officially released to create a PR on it.
library(RPresto)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
# creating a default connection
conn <- presto_default()
# prepare the sample data
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
squirrels <- tibble::tribble(
~group, ~name, ~role, ~n_squirrels, ~ n_squirrels2,
1, "Sam", "Observer", NA, 1,
1, "Mara", "Scorekeeper", 8, NA,
1, "Jesse", "Observer", NA, NA,
1, "Tom", "Observer", NA, 4,
2, "Mike", "Observer", NA, NA,
2, "Rachael", "Observer", NA, 6,
2, "Sydekea", "Scorekeeper", 14, NA,
2, "Gabriela", "Observer", NA, NA,
3, "Derrick", "Observer", NA, NA,
3, "Kara", "Scorekeeper", 9, 10,
3, "Emily", "Observer", NA, NA,
3, "Danielle", "Observer", NA, NA
)
squirrels$id <- 1:12
# copy the sample data to connection
if (dbExistsTable(conn, "squirrels")) { dbRemoveTable(conn, "squirrels") }
tbl.squirrels <- copy_to(conn, squirrels, "squirrels")
# calling fill() works
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
tbl.fill <- tbl.squirrels |>
window_order(id) |>
tidyr::fill(
n_squirrels,
n_squirrels2,
)
# getting the data now also works
tbl.fill
#> # Source: SQL [?? x 6]
#> # Database: PrestoConnection
#> # Ordered by: id
#> group name role n_squirrels n_squirrels2 id
#> <dbl> <chr> <chr> <dbl> <dbl> <int>
#> 1 1 Sam Observer NA 1 1
#> 2 1 Mara Scorekeeper 8 1 2
#> 3 1 Jesse Observer 8 1 3
#> 4 1 Tom Observer 8 4 4
#> 5 2 Mike Observer 8 4 5
#> 6 2 Rachael Observer 8 6 6
#> 7 2 Sydekea Scorekeeper 14 6 7
#> 8 2 Gabriela Observer 14 6 8
#> 9 3 Derrick Observer 14 6 9
#> 10 3 Kara Scorekeeper 9 10 10
#> # … with more rows
# the query is correct
tbl.fill |> show_query()
#> <SQL>
#> SELECT
#> "group",
#> "name",
#> "role",
#> LAST_VALUE("n_squirrels") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels",
#> LAST_VALUE("n_squirrels2") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels2",
#> "id"
#> FROM "squirrels"
# debug info
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dbplyr_2.3.1.9000 dplyr_1.1.0 RPresto_1.4.4.9000
#>
#> loaded via a namespace (and not attached):
#> [1] pillar_1.8.1 compiler_4.2.1 highr_0.9 prettyunits_1.1.1
#> [5] progress_1.2.2 R.methodsS3_1.8.2 R.utils_2.12.0 tools_4.2.1
#> [9] bit_4.0.5 digest_0.6.29 jsonlite_1.8.4 timechange_0.1.1
#> [13] lubridate_1.9.0 evaluate_0.16 lifecycle_1.0.3 tibble_3.2.0
#> [17] R.cache_0.16.0 pkgconfig_2.0.3 rlang_1.0.6 reprex_2.0.2
#> [21] cli_3.6.0 DBI_1.1.3 rstudioapi_0.14 curl_4.3.3
#> [25] yaml_2.3.5 xfun_0.33 fastmap_1.1.0 httr_1.4.4
#> [29] withr_2.5.0 styler_1.7.0 stringr_1.5.0 knitr_1.40
#> [33] hms_1.1.2 generics_0.1.3 fs_1.5.2 vctrs_0.5.2
#> [37] bit64_4.0.5 tidyselect_1.2.0 glue_1.6.2 R6_2.5.1
#> [41] fansi_1.0.4 rmarkdown_2.16 blob_1.2.3 tidyr_1.3.0
#> [45] purrr_1.0.1 magrittr_2.0.3 ellipsis_0.3.2 htmltools_0.5.3
#> [49] utf8_1.2.3 stringi_1.7.12 crayon_1.5.2 R.oo_1.25.0
Created on 2023-03-15 with reprex v2.0.2
For folks who want to use this hot fix before dbplyr 2.3.2 an RPresto 1.4.5 are officially released, you can install the development versions.
devtools::install_github("tidyverse/dbplyr")
devtools::install_github("jarodmeng/RPresto", ref = "fix_fill")
This issue is fixed in RPresto 1.4.6.
library(RPresto)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
# creating a default connection
conn <- presto_default()
# prepare the sample data
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
squirrels <- tibble::tribble(
~group, ~name, ~role, ~n_squirrels, ~ n_squirrels2,
1, "Sam", "Observer", NA, 1,
1, "Mara", "Scorekeeper", 8, NA,
1, "Jesse", "Observer", NA, NA,
1, "Tom", "Observer", NA, 4,
2, "Mike", "Observer", NA, NA,
2, "Rachael", "Observer", NA, 6,
2, "Sydekea", "Scorekeeper", 14, NA,
2, "Gabriela", "Observer", NA, NA,
3, "Derrick", "Observer", NA, NA,
3, "Kara", "Scorekeeper", 9, 10,
3, "Emily", "Observer", NA, NA,
3, "Danielle", "Observer", NA, NA
)
squirrels$id <- 1:12
# copy the sample data to connection
if (dbExistsTable(conn, "squirrels")) { dbRemoveTable(conn, "squirrels") }
tbl.squirrels <- copy_to(conn, squirrels, "squirrels")
# calling fill() works
# see https://dbplyr.tidyverse.org/reference/fill.tbl_lazy.html
tbl.fill <- tbl.squirrels |>
window_order(id) |>
tidyr::fill(
n_squirrels,
n_squirrels2,
)
# getting the data now also works
tbl.fill
#> # Source: SQL [?? x 6]
#> # Database: PrestoConnection
#> # Ordered by: id
#> group name role n_squirrels n_squirrels2 id
#> <dbl> <chr> <chr> <dbl> <dbl> <int>
#> 1 1 Sam Observer NA 1 1
#> 2 1 Mara Scorekeeper 8 1 2
#> 3 1 Jesse Observer 8 1 3
#> 4 1 Tom Observer 8 4 4
#> 5 2 Mike Observer 8 4 5
#> 6 2 Rachael Observer 8 6 6
#> 7 2 Sydekea Scorekeeper 14 6 7
#> 8 2 Gabriela Observer 14 6 8
#> 9 3 Derrick Observer 14 6 9
#> 10 3 Kara Scorekeeper 9 10 10
#> # ℹ more rows
# the query is correct
tbl.fill |> show_query()
#> <SQL>
#> SELECT
#> "group",
#> "name",
#> "role",
#> LAST_VALUE("n_squirrels") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels",
#> LAST_VALUE("n_squirrels2") IGNORE NULLS OVER (ORDER BY "id" ROWS UNBOUNDED PRECEDING) AS "n_squirrels2",
#> "id"
#> FROM "squirrels"
# debug info
sessionInfo()
#> R Under development (unstable) (2023-05-02 r84382)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS 14.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Asia/Singapore
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dbplyr_2.4.0 dplyr_1.1.2 RPresto_1.4.6
#>
#> loaded via a namespace (and not attached):
#> [1] bit_4.0.5 jsonlite_1.8.4 compiler_4.4.0 crayon_1.5.2
#> [5] reprex_2.0.2 tidyselect_1.2.0 blob_1.2.4 tidyr_1.3.0
#> [9] progress_1.2.2 yaml_2.3.7 fastmap_1.1.1 R6_2.5.1
#> [13] generics_0.1.3 curl_5.0.0 knitr_1.43 tibble_3.2.1
#> [17] lubridate_1.9.2 DBI_1.1.3 pillar_1.9.0 rlang_1.1.1
#> [21] utf8_1.2.3 xfun_0.39 fs_1.6.2 bit64_4.0.5
#> [25] timechange_0.2.0 cli_3.6.1 withr_2.5.0 magrittr_2.0.3
#> [29] digest_0.6.31 rstudioapi_0.14 hms_1.1.3 lifecycle_1.0.3
#> [33] prettyunits_1.1.1 vctrs_0.6.5 evaluate_0.20 glue_1.6.2
#> [37] fansi_1.0.4 rmarkdown_2.21 purrr_1.0.1 httr_1.4.5
#> [41] tools_4.4.0 pkgconfig_2.0.3 htmltools_0.5.5
Created on 2024-01-26 with reprex v2.0.2