Rcpp Modules
knapply opened this issue · comments
It would essentially follow the C++ API via a set of wrapper classes inheriting from simdjson::dom::parser
and simdjson::dom::element
(with helpful twists/additions like converting to R objects).
- Pros
- lower-level bindings enabling simpler experimentation and testing
- provides a stop-gap for users needing that extra bit of customization to gain functionality not otherwise provided
- It fills a gap in in the R ecosystem. To my knowledge, R doesn't have anything like this for JSON data.
- Cons
- something else to maintain
- compilation time?
- ?
Here's a prototype I've been using for exploration and experiments that has proven handy.
library(RcppSimdJson)
simplify_to <- list(data_frame = 0, matrix = 1, vector = 2, list = 3)
type_policy <-list(anyting_goes = 0, ints_as_dbls = 1, strict = 2)
int64_opt <- list(double = 0, string = 1, integer64 = 2)
toy_json <- c(
'[{"id":12345789101112,"screen_name":"whoever","text":"#rstats4lyfe"},
{"id":12345789101113,"screen_name":"whoever2","text":"#rstats5lyfe"}]'
)
parser <- new(Parser, max_capacity = 10)
parser
#> C++ object <0x5593401a9d80> of class 'Parser' <0x55933e8ee910>
parser$max_capacity()
#> [1] 10
parser$set_capacity(nchar(toy_json))
parser$max_capacity()
#> [1] 143
element <- parser$parse(toy_json)
element
#> C++ object <0x55933d95b9e0> of class 'Element' <0x55933fe903d0>
element$type()
#> [1] "ARRAY"
element$is_array()
#> [1] TRUE
element$deserialize(simplify_to$data_frame, type_policy$anyting_goes,
int64_opt$integer64)
#> id screen_name text
#> 1 12345789101112 whoever #rstats4lyfe
#> 2 12345789101113 whoever2 #rstats5lyfe
file_path <- path.expand("~/Documents/rcppsimdjson/inst/jsonexamples/small/jsoniter_scala/twitter_api_compact_response.json")
parser2 <- new(Parser)
element2 <- parser2$load(file_path)
tweet_df <- element2$deserialize(simplify_to$data_frame, type_policy$anyting_goes,
int64_opt$integer64)
tibble::as_tibble(tweet_df)
#> # A tibble: 2 x 16
#> created_at id id_str text truncated entities source user retweeted_status is_quote_status retweet_count favorite_count favorited retweeted possibly_sensit… lang
#> <chr> <int64> <chr> <chr> <lgl> <list> <chr> <list> <list> <lgl> <int> <int> <lgl> <lgl> <lgl> <chr>
#> 1 Thu Apr 06 1… 850007… 8500073… "RT @TwitterDev: 1/ T… FALSE <named … "<a href=\"h… <name… <named list [15… FALSE 284 0 FALSE FALSE FALSE en
#> 2 Mon Apr 03 1… 848930… 8489305… "RT @TwitterMktg: Sta… FALSE <named … "<a href=\"h… <name… <named list [15… FALSE 111 0 FALSE FALSE FALSE en
I am all for experimenting. We are not aiming to replace jsonlite
for wide usage. But we won't mind kicking its a$#e in terms of speed :)
This isn't a really priority of mine at this point, so closing for now.