Add a `with_deduplication` helper to run a vectorized function after deduplicating the input.
orgadish opened this issue · comments
I discovered that fs::path_file
and fs::path_dir
run very slowly on windows (see fs
issue 424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path")
, most of the vector is duplication. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful for fs::path_
functions.
The most straightforward approach is:
with_deduplication <- function(f) {
function(x, ...) {
ux <- unique(x)
f(ux, ...)[match(x, ux)]
}
}
I've also submitted a PR into vctrs
to speed this up (see vctrs
issue 1857 and PR 1858).
I'm not sure where this helper should live, but since it's an extension of functional programming, I think it would make sense to be in purrr
.
IMO it's easier to solve this sort of problem with memoisation, i.e. with https://github.com/r-lib/memoise.
As far as I can tell, memoise
acts on the input to the function, not on the individual elements in the input. However, I do think it's a good idea to add this capability into memoise
directly, and will suggest this idea there:
TOTAL_N = 1e6
UNIQUE_N = 10
repeated_strs <- purrr::map_chr(1:5*UNIQUE_N,
\(x) sample(LETTERS, 3) |> paste(collapse="/")) |>
unique() |>
head(UNIQUE_N) |> # Ensure UNIQUE_N unique items.
rep(TOTAL_N / UNIQUE_N) |> # Create TOTAL_N total items.
sample() # Shuffle order
with_dedup <- function(f) {
function(x) {
ux <- unique(x)
f(ux)[data.table::chmatch(x, ux)]
}
}
bench::mark(
direct = stringr::str_to_lower(repeated_strs),
dedup = with_dedup(stringr::str_to_lower)(repeated_strs),
memo = memoise::memoise(stringr::str_to_lower)(repeated_strs),
iterations = 100
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 53.7ms 63ms 16.0 3.84MB 2.18
#> 2 dedup 12.4ms 15.4ms 64.7 13.34MB 36.4
#> 3 memo 84ms 99.5ms 10.0 10.37MB 4.93
Created on 2023-07-26 with reprex v2.0.2