Yuri-M-Dias / fsttable

An interface to fast on-disk data tables stored with the fst format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fsttable

Linux/OSX Build Status Windows Build status License: AGPL v3 Lifecycle: maturing

R package fsttable aims to provide a fully functional data.table interface to on-disk fst files. The focus of the package is on keeping memory usage as low as possible woithout sacrificing features of in-memory data.table operations.

Installation

You can install the latest package version with:

devtools::install_github("fstpackage/fsttable")

Example

First, we create a on-disk fst file containing a medium sized dataset:

library(fsttable)

# write some sample data to disk
nr_of_rows <- 1e6
x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
fst::write_fst(x, "1.fst")

Then we define our fst_table by using:

ft <- fst_table("1.fst")

This fst_table can be used as a regular data.table object. For example, we can print:

ft
#> <fst file>
#> 1e+06 rows, 2 columns
#> 
#>               X     Y
#>           <int> <chr>
#> 1             1     B
#> 2             2     C
#> 3             3     D
#> 4             4     E
#> 5             5     F
#> --           --    --
#> 999996   999996     K
#> 999997   999997     L
#> 999998   999998     M
#> 999999   999999     N
#> 1000000 1000000     O

we can select columns:

ft[, .(Y)]
#> <fst file>
#> 1e+06 rows, 1 columns
#> 
#>             Y
#>         <chr>
#> 1           B
#> 2           C
#> 3           D
#> 4           E
#> 5           F
#> --         --
#> 999996      K
#> 999997      L
#> 999998      M
#> 999999      N
#> 1000000     O

and rows:

ft[1:4,]
#> <fst file>
#> 4 rows, 2 columns
#> 
#>       X     Y
#>   <int> <chr>
#> 1     1     B
#> 2     2     C
#> 3     3     D
#> 4     4     E

Or both at the same time:

ft[1:4, .(X)]
#> <fst file>
#> 4 rows, 1 columns
#> 
#>       X
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4

Memory

During the operations shown above, the actual data was never fully loaded from the file. That’s because of fsttable’s philosophy of keeping RAM usage as low as possible. Printing a few lines of a table doesn’t require knowlegde of the remaining lines, so fsttable will never actualy load them.

Even when you create a new set:

ft2 <- ft[1:4, .(X)]

No actual data is being loaded into RAM. The copy still uses the original fst file to keep the data on-disk:

# small size because actual data is still on disk
object.size(ft2)
#> 5808 bytes

About

An interface to fast on-disk data tables stored with the fst format

License:GNU Affero General Public License v3.0


Languages

Language:R 100.0%