Script runs with dplyr, but dtplyr shows it generates over 2^31 rows

Question

Script runs with dplyr, but dtplyr shows it generates over 2^31 rows

vRadAdamBender opened this issue 10 months ago · comments

We have a script that runs fine (but very slowly) using dplyr. When we convert to DTPLYR it will run ONE TIME. However, the 2nd time we try to run it it shows that it generates 2^31 rows and fails.

We've also tried using between() instead of the 3 conditions in join_by and get exactly the same result. Success in dplyr, but only one run in dtplyr then it stops working for 24 hours - then magically it will work once the following day, then the same result of 2^31 rows.

This also only seems to happen after using the foreach/doparallel package. Even closing all sessions, logging out, etc... coming back to the script that ran just a few mins ago it will fail.

Wondering if there's an issue with the syntax generation that makes the execution between dplyr and dtplyr different, or if something is being cached that causes the issue.

Happy to provide any additional details or meet to discuss with whoever might want to try diagnosing this.

====================================================

study_cap_demand_measures <-
mrpa_dbo_vrad_transactions %>%
arrange(
rad_id
) %>%
left_join(
ds_rad_schedule_summary,
join_by(
"rad_id" == "rad_id",
"date_time_distributed_utc" >= "start_date_time_utc",
"date_time_distributed_utc" < "end_date_time_utc"
)
)

Mark Fairbanks · Answer 1 · Wed Sep 13 2023 00:43:00 GMT+0800 (China Standard Time)

join_by() isn't currently supported by dtplyr. It will be in the future (not exactly sure the timeline).

You can follow #409 if you want to track when it gets added, so I'm going to close this one. If you have any other questions let me know!