prestodb / RPresto

DBI-based adapter for Presto for the statistical programming language R.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dplyr translation for approx_percentile()

copernican opened this issue · comments

The default translation of the R function quantile() is incorrect:

> translate_sql(quantile(x, 0.9))
<SQL> PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY `x`) OVER ()

I realize that Presto does not implement exact quantiles, but it does provide approx_percentile(). This would enable a translation like

> translate_sql(quantile(x, 0.9))
<SQL> APPROX_PERCENTILE(`x`, 0.9)

This would also allow a translation of median() as a special case. There is an approach to this in dbplyr for Teradata, and it seems like it could be implemented with minimal effort for RPresto, possibly adding a once-per-session warning that the result is approximate.

Is this something we could consider doing? If so, I'd be happy to create a PR.

Thanks for identifying the issue. Actually I'd much more prefer disabling the quantile translation instead. We'd want the users to explicitly specify their intention by referencing the approx_percentile function directly if they are OK with an approximation.

Okay, then I think there are maybe two pieces of work.

  1. Implement an R function approx_percentile() that is translated to the matching Presto function. The benefit is that users can call it as they would quantile() without having to know about build_sql() and so on. So, something like mutate(presto_tbl, q = approx_percentile(x, 0.9)) where x is the unquoted name of a column in presto_tbl.

  2. Disable the quantile() default translation and throw an error that directs the user to the function described in (1).

How does that sound?

You don't need (1), unrecognized functions get passed directly to the backend:

(
  src_presto(...)
  %>% tbl(sql('select 1 as x'))
  %>% summarise(p=approx_percentile(x, 0.1))
  %>% show_query()
)

I agree. Submitted #120.