eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Edge cases and defaults

knapply opened this issue · comments

What are the edge cases {RcppSimdJson} needs to handle?

What should be the defaults (internal and user-level)?

  • {}
    • Current candidates:
      • [ ] named list(): structure(list(), .Names = character(0))
        • same as jsonlite::fromJSON()
      • [ ] NULL
        • same as jsonify::from_json()
    • Resolved by 94e571a (user passes any R object to replace empty objects with NULL as the defaut).
  • []
    • Current candidates:
      • [ ] vector(): logical(length = 0L)
      • [ ] vector(mode = "list"): list()
        • same as jsonify::from_json() and jsonlite::fromJSON()
    • Resolved by 94e571a (user passes any R object to replace empty arrays with NULL as the defaut).
  • null
    • like {} and [], allow user to pass any R object to replace single null, with NULL as the default.

@dcooley Moving the conversation from #17 over here:

  • []: vector()
  • {}: structure(list(), .Names = character(0))

Here's the rationale for going about {} and [] this way:

{}

If we say the R representation of {"a":1} is list(a = 1L), we're really saying that its R representation is structure(list(1L), .Names = "a").

If we follow that analogy, empty object {} is structure(list(), .Names = character(0)), which print()s as named list().

JSON R Representation equivalent/print()
{"a":1} structure(list(1L), .Names = "a") list(a = 1L)
{} structure(list(), .Names = character()) named list()

[]

And similarly you're returing [] as NA_LOGICAL?

See above: [] returns Rcpp::LogicalVector(0) (vector of length 0):

RcppSimdJson:::.deserialize_json('[]')
#> logical(0)

This is really a matter of deciding which of the following represents []:

  1. vector()
  2. vector(mode = "list").

But it does raise a few other questions:

  1. Are empty arrays recursive?
  2. Are empty arrays homogeneous?
  3. Are empty arrays "just arrays"... or are they really a special case?

{jsonify} and {jsonlite} use this pattern for simplifcation:

JSON R str()
["a",3.14,1,true,null] c("a","3.14","1","TRUE",NA) chr [1:5] "a" "3.14" "1" "TRUE" NA
[3.14,1,true,null] c(3.14, 1, 1, NA) num [1:4] 3.14 1 1 NA
[1,true,null]' c(1L, 1L, NA) int [1:3] 1 1 NA
[true,null]' c(TRUE, NA) logi [1:2] TRUE NA
[null] NA logi NA
[] list() list()

RcppSimdJson:::.deserialize_json()'s current default uses rcppsimdjson::Type_Policy::anything_goes, which follows nearly the same pattern of collapsing arrays of scalars into R vectors, no matter what.

Here's the same table with the proposed default {RcppSimdJson} pattern for simplification in bold:

JSON R str()
["a",3.14,1,true,null] c("a","3.14","1","TRUE",NA) chr [1:5] "a" "3.14" "1" "TRUE" NA
[3.14,1,true,null] c(3.14, 1, 1, NA) num [1:4] 3.14 1 1 NA
[1,true,null]' c(1L, 1L, NA) int [1:3] 1 1 NA
[true,null]' c(TRUE, NA) logi [1:2] TRUE NA
[null] NA logi NA
[] logical(0) logi(0)
[] list() list()

The idea behind the proposed {RcppSimdJson} approach is to just consistently following the same pattern.

I'd say the difference is that it treats [] as non-recursive and homogeneous (thus becoming a vector) instead of a special case like {jsonify} and {jsonlite} seem to do (potentially recursive and/or homogeneous?).

🤷‍♂️ Are there parsers / linters / specs / best practices?

I like how set user-drive policy options otherwise. That is simply the most flexible.

@lemire Thoughts?

It seems @dcooley and I don't quite agree with some of the default choices {jsonlite} makes. There are compromises and edge cases no matter how you do it, so a 100% consensus isn't likely.

Jeroen Ooms' The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects is a wonderful deep dive into the a lot of the gory, R-specific details.

commented

@knapply your logic all makes complete sense. I also tried to stay consistent with jsonlite, but I think I deviated away where round-trips didn't work because I wanted a consistent way to get to and from the same object. (I'm not saying I was right, but it's what I did).

toJSON( NULL )
# {}
fromJSON('{}')
# named list()
to_json( NULL )
# {}
from_json('{}')
NULL

So in summary, I don't have any issues with the defaults chosen for RcppSimdJson, but I think the round-trip is something to keep in mind if/when we can write the to json side of it.

I'll try and sort the user-passable options ASAP so we at least have maximum flexibility.

I think there are good arguments both ways to be honest.

I have a deployed shiny app that uses {jsonify} to serialize some crazy list columns so users can download spreadsheets, make modifications, and re-upload the changes. We're talking 100k+ row monstrosities that {jsonify} handles with ease.

The tinkering I had to do was all in excel-related packages to get round trips to pass identical() testing, not at all in {jsonify}. I don't think there's a perfect answer for what "right" looks here, but you can't really do better than that!

I swapped out the JSON parsing routine for an API client with .deserialize_json() for another real-world use case and realized that allowing the user to also replace single nulls with any R object would eliminate a whole step in cleaning up JSON responses from APIs that spam nulls in every object.

I probably have some variation of %||% defined in any code that touches a web API, so I don't know why I didn't think of this earlier.

  • This would
    • eliminate that the need for the user to do that at the R level
      • removing a commonly needed step in constructing data frames manually
    • provide a way to match jsonify::from_json(simplify = FALSE)'s behavior
    • be totally opt-in, as the default can just be set to NULL

So it's on the TODO list.

Anything actionable here now? I may have lost track of the main storyline...

The last open checkbox/comment before yours #18 (comment) (unless someone makes a case why it's a bad idea) is the only thing remaining.

It's a matter of adding a single_null = R_NilValue parameter (single_null= is just the first thing that comes to mind) to .deserialize_json() and .load_json() and all the rcppsimdjson::deserialize::simplify_*() functions and anything calling them.

Ultimately, the returned value is here:

case simdjson::dom::element_type::NULL_VALUE:
return R_NilValue;

so that would just need to change from R_NilValue to single_null.

It's the same pattern as empty_array= and empty_object=.

I left the issue open as a reminder to myself, but it can be closed.

Is what we have "good enough" to be added to master to form 0.0.6?

Yes.

Ok, no pending merge or pr. I look over what we have and roll it up this eve or tomorrow.

single nulls handled in 9957d60