dplyr translation for approx_percentile()
copernican opened this issue · comments
The default translation of the R function quantile()
is incorrect:
> translate_sql(quantile(x, 0.9))
<SQL> PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY `x`) OVER ()
I realize that Presto does not implement exact quantiles, but it does provide approx_percentile()
. This would enable a translation like
> translate_sql(quantile(x, 0.9))
<SQL> APPROX_PERCENTILE(`x`, 0.9)
This would also allow a translation of median()
as a special case. There is an approach to this in dbplyr
for Teradata, and it seems like it could be implemented with minimal effort for RPresto
, possibly adding a once-per-session warning that the result is approximate.
Is this something we could consider doing? If so, I'd be happy to create a PR.
Thanks for identifying the issue. Actually I'd much more prefer disabling the quantile
translation instead. We'd want the users to explicitly specify their intention by referencing the approx_percentile
function directly if they are OK with an approximation.
Okay, then I think there are maybe two pieces of work.
-
Implement an R function
approx_percentile()
that is translated to the matching Presto function. The benefit is that users can call it as they wouldquantile()
without having to know aboutbuild_sql()
and so on. So, something likemutate(presto_tbl, q = approx_percentile(x, 0.9))
wherex
is the unquoted name of a column inpresto_tbl
. -
Disable the
quantile()
default translation and throw an error that directs the user to the function described in (1).
How does that sound?
You don't need (1), unrecognized functions get passed directly to the backend:
(
src_presto(...)
%>% tbl(sql('select 1 as x'))
%>% summarise(p=approx_percentile(x, 0.1))
%>% show_query()
)
I agree. Submitted #120.