tidyverse / purrr

A functional programming toolkit for R

Home Page:https://purrr.tidyverse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a `with_deduplication` helper to run a vectorized function after deduplicating the input.

orgadish opened this issue · comments

I discovered that fs::path_file and fs::path_dir run very slowly on windows (see fs issue 424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplication. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful for fs::path_ functions.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
} 

I've also submitted a PR into vctrs to speed this up (see vctrs issue 1857 and PR 1858).

I'm not sure where this helper should live, but since it's an extension of functional programming, I think it would make sense to be in purrr.

IMO it's easier to solve this sort of problem with memoisation, i.e. with https://github.com/r-lib/memoise.

As far as I can tell, memoise acts on the input to the function, not on the individual elements in the input. However, I do think it's a good idea to add this capability into memoise directly, and will suggest this idea there:

TOTAL_N = 1e6
UNIQUE_N = 10

repeated_strs <- purrr::map_chr(1:5*UNIQUE_N,
                                \(x) sample(LETTERS, 3) |> paste(collapse="/")) |> 
  unique() |> 
  head(UNIQUE_N) |> # Ensure UNIQUE_N unique items.
  rep(TOTAL_N / UNIQUE_N) |> # Create TOTAL_N total items.
  sample()  # Shuffle order

with_dedup <- function(f) {
  function(x) {
    ux <- unique(x)
    f(ux)[data.table::chmatch(x, ux)]
  }
}

bench::mark(
  direct = stringr::str_to_lower(repeated_strs),
  dedup = with_dedup(stringr::str_to_lower)(repeated_strs),
  memo = memoise::memoise(stringr::str_to_lower)(repeated_strs),
  iterations = 100
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.7ms     63ms      16.0    3.84MB     2.18
#> 2 dedup        12.4ms   15.4ms      64.7   13.34MB    36.4 
#> 3 memo           84ms   99.5ms      10.0   10.37MB     4.93

Created on 2023-07-26 with reprex v2.0.2