vroom vroom!

The fastest delimited reader for R, 631.84 MB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use.

vroom uses multiple threads for indexing and materializing non-character vectors, to further improve performance.

However it has no (current) support for windows newlines, quoted fields, comments, whitespace trimming and other niceties which also slow down and complicate parsing.

package	time (sec)	speedup	throughput
vroom	2.64	42.12	631.84 MB
data.table	19.67	5.65	84.73 MB
readr	25.82	4.30	64.55 MB
read.delim	111.12	1.00	15.00 MB

Installation

Install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("jimhester/vroom")

Example

vroom::vroom("mtcars.tsv")
#> # A tibble: 32 x 12
#>    model    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>  <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#>  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ... with 22 more rows

Benchmarks

The speed quoted above is from a dataset with 14,776,615 rows and 11 columns, see the benchmark article for details.

Thanks

Gabe Becker, Luke Tierney and Tomas Kalibera for implementing and maintaining the Altrep framework
Romain François, whose Altrepisode package and related blog-posts were a great guide for creating new Altrep objects in C++.
Matt Dowle and the rest of the Rdatatable team, data.table::fread() is blazing fast and great motivation!

danielarantes / vroom

vroom vroom!

Installation

Example

Benchmarks

Thanks

About

Languages