dplyr translation for approx_percentile()

Question

dplyr translation for approx_percentile()

copernican opened this issue 4 years ago · comments

The default translation of the R function quantile() is incorrect:

> translate_sql(quantile(x, 0.9))
<SQL> PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY `x`) OVER ()

I realize that Presto does not implement exact quantiles, but it does provide approx_percentile(). This would enable a translation like

> translate_sql(quantile(x, 0.9))
<SQL> APPROX_PERCENTILE(`x`, 0.9)

This would also allow a translation of median() as a special case. There is an approach to this in dbplyr for Teradata, and it seems like it could be implemented with minimal effort for RPresto, possibly adding a once-per-session warning that the result is approximate.

Is this something we could consider doing? If so, I'd be happy to create a PR.

Ismail Onur Filiz · Answer 1 · Thu Mar 26 2020 03:21:35 GMT+0800 (China Standard Time)

Thanks for identifying the issue. Actually I'd much more prefer disabling the quantile translation instead. We'd want the users to explicitly specify their intention by referencing the approx_percentile function directly if they are OK with an approximation.

Sean Wilson · Answer 2 · Thu Mar 26 2020 03:41:12 GMT+0800 (China Standard Time)

Okay, then I think there are maybe two pieces of work.

Implement an R function approx_percentile() that is translated to the matching Presto function. The benefit is that users can call it as they would quantile() without having to know about build_sql() and so on. So, something like mutate(presto_tbl, q = approx_percentile(x, 0.9)) where x is the unquoted name of a column in presto_tbl.
Disable the quantile() default translation and throw an error that directs the user to the function described in (1).

How does that sound?

Ismail Onur Filiz · Answer 3 · Fri Mar 27 2020 00:47:45 GMT+0800 (China Standard Time)

You don't need (1), unrecognized functions get passed directly to the backend:

(
  src_presto(...)
  %>% tbl(sql('select 1 as x'))
  %>% summarise(p=approx_percentile(x, 0.1))
  %>% show_query()
)

Sean Wilson · Answer 4 · Fri Mar 27 2020 05:16:09 GMT+0800 (China Standard Time)

I agree. Submitted #120.