eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Export `.deserialize_json()`

shikokuchuo opened this issue · comments

Please consider exporting .deserialize_json(), as fparse() performs a lot of safety checks, detracting needlessly from performance. Preferably with the 'simplify_to' argument defaulting to 3L, again for best performance. Thanks.

I am all ears. Can you possibly sketch it out in a PR (that we maybe iterate over) ?

You can already call it. Isn't that good enough?

> j <- jsonlite::toJSON(data.frame(a=c(42L, 88L), b=c("brown", "fox")))
> j
[{"a":42,"b":"brown"},{"a":88,"b":"fox"}] 
> RcppSimdJson:::.deserialize_json(j)
   a     b
1 42 brown
2 88   fox
> 

I have created the PR for your perusal.

The overhead from using ::: ...

  • it would allow me to call it from a CRAN package (although that is not a priority)

(See discussion at your PR #79)

commented

it would allow me to call it from a CRAN package (although that is not a priority)

If this is your goal, you can LinkTo this library, include the RcppSimdJson/deserialize.hpp and write your own version of deserialize in your own package?

@dcooley that's interesting I had not considered something like that. I have my own fork for my own use. I just thought other CRAN users may want to benefit from the same tweaks.

  1. It's LinkingTo and it exposes header files. Actually exporting (at the foreign function interface C level calls across packages is more powerful. We could even export the internal C++ function with its SEXP only C interface).
  2. But we still have not seen any rationalization for the rumoured speed increase and what your usage pattern: few large JSON blobs? Many small ones?
  3. You still can today workaround the ::: limitation in CRAN package checks. Many packages do.

I do not think we are spending time on worthwhile initiatives here:

> jsoninputfile <- system.file("jsonexamples", "twitter.json", package="RcppSimdJson")
> jsoninput <- paste(readLines(jsoninputfile), collapse="")
> res <- rbenchmark::benchmark(RcppSimdJson::fparse(jsoninput), RcppSimdJson:::.deserialize_json(jsoninput))
> res[,1:4]
                                         test replications elapsed relative
2 RcppSimdJson:::.deserialize_json(jsoninput)          100   0.240    1.000
1             RcppSimdJson::fparse(jsoninput)          100   0.248    1.033
> 

The difference is essentially the same as measurement noise.

As I am the one having to maintain an additional interface, which is one we actually kept unexposed for a reason, I am not in favor of the proposed change. There were a number of small things wrong in #79, and they were not addressed but argued over so I closed this.

For valid demonstrated reason I could consider adding another 'narrower' converter. So far, such a reason has not been provided.

It is your package so the design points are totally up to you. I merely addressed each of your comments in PR #79, I am sorry if they appeared argumentative to you that was never the intention. As mentioned above, I just wanted to share my tweaks. Thanks.

I am not against helping you. But I asked you three times to show me representative input and what timing difference you see on it. You still have not done it.

The worst outcome, clearly, is for you to run away and to shadow or vendor or re-package simdjson needlessly. Nobody is aided by that. But so far I am simply not seeing a reasonable speed gain from going directory to the currently internal-only function.

Right - by your own benchmark the difference is 8ms. 3% of the timing of your large chunk of json. Imagine lots of small chunks, then that 8ms becomes a large %.

What I'm trying to get at is that SimdJson can parse a LOT of json in 8 milliseconds, so for just a wrapper to take 8 milliseconds seems a bit of a waste.

That is 8 ms over 100 runs. Also, 8 / 240 is about 3.33 per cent -- exactly what the relative change column shows. And if you ever looked at benchmarking in any detail you would be aware that such small values are of the order that randomly changes, Morever, what we get with RcppSimdJson is an orders of magnitude difference to alternative. If all that is not good enough for you (in the context of an interpreted language) well then I guess I cannot stop you from reinventing that wheel you so desparately crave.

And for completeness a re-run with simplify_to=3l (and some more runs) ends up again in the same vicinity of a 2% improvement:

> res <- rbenchmark::benchmark(RcppSimdJson::fparse(jsoninput), RcppSimdJson:::.deserialize_json(jsoninput, simplify_to=3L), replications=500)
> res[,1:4]
                                                          test replications elapsed relative
2 RcppSimdJson:::.deserialize_json(jsoninput, simplify_to = 3)          500   1.264    1.000
1                              RcppSimdJson::fparse(jsoninput)          500   1.290    1.021
> 

(I currently have my hands full with another issue but I think I can help you by making the currently internal function available for you as a C function you can hit directly from C code. That way you save the R call overhead too. I am doing that a little bit more e.g. in package RcppSpdlog which exports a few C-level plain functions for logging which for example the (nicer to use R wrapper) spdl picks up. Let me know if that would be of interest. I am currently tied up with another package but I prepare a basic test case pretty quickly...)

If you make it available at the C level then I would definitely make use of it. Great idea if you are doing that more with your other packages. No rush at all, this is not urgent for me. Thanks.