danielarantes / vroom

An experiment with lazily reading indexed files

Home Page:http://jimhester.github.io/vroom

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vroom vroom!

Travis build status AppVeyor build status Lifecycle: experimental

The fastest delimited reader for R, 631.84 MB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use.

vroom uses multiple threads for indexing and materializing non-character vectors, to further improve performance.

However it has no (current) support for windows newlines, quoted fields, comments, whitespace trimming and other niceties which also slow down and complicate parsing.

package time (sec) speedup throughput
vroom 2.64 42.12 631.84 MB
data.table 19.67 5.65 84.73 MB
readr 25.82 4.30 64.55 MB
read.delim 111.12 1.00 15.00 MB

Installation

Install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("jimhester/vroom")

Example

vroom::vroom("mtcars.tsv")
#> # A tibble: 32 x 12
#>    model    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>  <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#>  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ... with 22 more rows

Benchmarks

The speed quoted above is from a dataset with 14,776,615 rows and 11 columns, see the benchmark article for details.

Thanks

About

An experiment with lazily reading indexed files

http://jimhester.github.io/vroom

License:Other


Languages

Language:C++ 87.4%Language:CMake 6.5%Language:R 5.8%Language:Shell 0.3%