eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

More Flexible Queries

knapply opened this issue · comments

We can do better than using a single JSON Pointer for query=.

json <- c(
    json1 = '{"a":[1,2,3],"b":[4,5,6]}',
    json2 = '{"c":[4,5,6],"d":[7,8,9]}'
)

If you know exactly what values you want to pull into R, you could do the following:

RcppSimdJson::fparse(
    json, 
    query = "a/1",
    error_ok = TRUE
)
#> $json1
#> [1] 2
#> 
#> $json2
#> NULL

RcppSimdJson::fparse(
    json, 
    query = "b/2",
    error_ok = TRUE
)
#> $json1
#> [1] 6
#> 
#> $json2
#> NULL

RcppSimdJson::fparse(
    json, 
    query = "c/1",
    error_ok = TRUE
)
#> $json1
#> NULL
#> 
#> $json2
#> [1] 5

RcppSimdJson::fparse(
    json, 
    query = "d/2",
    error_ok = TRUE
)
#> $json1
#> NULL
#> 
#> $json2
#> [1] 9

But that's more tedious than just parsing the whole thing followed by so some post-parse-processing (which you'd have to do clean up the NULLs anyways).

It's a total waste of potential.

We can do this instead:

queries <- list(
    query_for_json1 = c("a/1", "b/2"),
    query_for_json2 = c("c/1", "d/2")
)

RcppSimdJson::fparse(json, queries)
#> $query_name_for_json1
#> $query_name_for_json1[[1]]
#> [1] 2
#> 
#> $query_name_for_json1[[2]]
#> [1] 6
#> 
#> 
#> $query_name_for_json2
#> $query_name_for_json2[[1]]
#> [1] 5
#> 
#> $query_name_for_json2[[2]]
#> [1] 9

This would dramatically increase the amount of work that can be done before any R objects materialize and minimize the amount of post-parse-processing a user might have to do in R (and potentially eliminate it for some sane and stable JSON schemata).

The code hygiene and performance benefits are absolutely worth it.

  • Proposed API:
    • single queries "recycle" (as they do now: this wouldn't break anyone's code)
    • the length of multiple queries must match the length json= and are applied in a zip-like fashion
    • nesting queries inside a list (ListOf<CharacterVector>) like the example above provides a way to apply multiple queries to each element
      • results of named queries carry the query names (also in example)