eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rcpp Modules

knapply opened this issue · comments

It would essentially follow the C++ API via a set of wrapper classes inheriting from simdjson::dom::parser and simdjson::dom::element (with helpful twists/additions like converting to R objects).

  • Pros
    • lower-level bindings enabling simpler experimentation and testing
    • provides a stop-gap for users needing that extra bit of customization to gain functionality not otherwise provided
    • It fills a gap in in the R ecosystem. To my knowledge, R doesn't have anything like this for JSON data.
  • Cons
    • something else to maintain
    • compilation time?
    • ?

Here's a prototype I've been using for exploration and experiments that has proven handy.

library(RcppSimdJson)

simplify_to <- list(data_frame = 0, matrix = 1, vector = 2, list = 3)
type_policy <-list(anyting_goes = 0, ints_as_dbls = 1, strict = 2)
int64_opt <- list(double = 0, string = 1, integer64 = 2)

toy_json <- c(
  '[{"id":12345789101112,"screen_name":"whoever","text":"#rstats4lyfe"},
    {"id":12345789101113,"screen_name":"whoever2","text":"#rstats5lyfe"}]'
)

parser <- new(Parser, max_capacity = 10)
parser
#> C++ object <0x5593401a9d80> of class 'Parser' <0x55933e8ee910>

parser$max_capacity()
#> [1] 10
parser$set_capacity(nchar(toy_json))
parser$max_capacity()
#> [1] 143

element <- parser$parse(toy_json)
element
#> C++ object <0x55933d95b9e0> of class 'Element' <0x55933fe903d0>

element$type()
#> [1] "ARRAY"
element$is_array()
#> [1] TRUE

element$deserialize(simplify_to$data_frame, type_policy$anyting_goes, 
                    int64_opt$integer64)
#>               id screen_name         text
#> 1 12345789101112     whoever #rstats4lyfe
#> 2 12345789101113    whoever2 #rstats5lyfe

file_path <- path.expand("~/Documents/rcppsimdjson/inst/jsonexamples/small/jsoniter_scala/twitter_api_compact_response.json")

parser2 <- new(Parser)
element2 <- parser2$load(file_path)

tweet_df <- element2$deserialize(simplify_to$data_frame, type_policy$anyting_goes, 
                                 int64_opt$integer64)
tibble::as_tibble(tweet_df)
#> # A tibble: 2 x 16
#>   created_at    id      id_str   text                   truncated entities source        user   retweeted_status is_quote_status retweet_count favorite_count favorited retweeted possibly_sensit… lang 
#>   <chr>         <int64> <chr>    <chr>                  <lgl>     <list>   <chr>         <list> <list>           <lgl>                   <int>          <int> <lgl>     <lgl>     <lgl>            <chr>
#> 1 Thu Apr 06 1… 850007… 8500073… "RT @TwitterDev: 1/ T… FALSE     <named … "<a href=\"h… <name… <named list [15… FALSE                     284              0 FALSE     FALSE     FALSE            en   
#> 2 Mon Apr 03 1… 848930… 8489305… "RT @TwitterMktg: Sta… FALSE     <named … "<a href=\"h… <name… <named list [15… FALSE                     111              0 FALSE     FALSE     FALSE            en

I am all for experimenting. We are not aiming to replace jsonlite for wide usage. But we won't mind kicking its a$#e in terms of speed :)

This isn't a really priority of mine at this point, so closing for now.