The fastest delimited reader for R, 631.84 MB/sec.
But that’s impossible! How can it be so fast?
vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use.
vroom uses multiple threads for indexing and materializing non-character vectors, to further improve performance.
However it has no (current) support for windows newlines, quoted fields, comments, whitespace trimming and other niceties which also slow down and complicate parsing.
package | time (sec) | speedup | throughput |
---|---|---|---|
vroom | 2.64 | 42.12 | 631.84 MB |
data.table | 19.67 | 5.65 | 84.73 MB |
readr | 25.82 | 4.30 | 64.55 MB |
read.delim | 111.12 | 1.00 | 15.00 MB |
Install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("jimhester/vroom")
vroom::vroom("mtcars.tsv")
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
The speed quoted above is from a dataset with 14,776,615 rows and 11 columns, see the benchmark article for details.
- Gabe Becker, Luke Tierney and Tomas Kalibera for implementing and maintaining the Altrep framework
- Romain François, whose Altrepisode package and related blog-posts were a great guide for creating new Altrep objects in C++.
- Matt Dowle and the rest of the
Rdatatable team,
data.table::fread()
is blazing fast and great motivation!