eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Accept json=raw() directly to skip rawToChar step on JSON received from curl

MichaelChirico opened this issue · comments

As identified in this Twitter thread:

https://twitter.com/michael_chirico/status/1280656819606548480

See this Gist:

https://gist.github.com/MichaelChirico/f5e09ab9f5f437bb0286e8a42941a3e1

The performance of fparse is already damn impressive, but let's see if we can't do a mite better 😎

JSON as raw can be retrieved like so:

gist = file.path(
  'https://gist.githubusercontent.com/MichaelChirico',
  'f5e09ab9f5f437bb0286e8a42941a3e1', 'raw',
  'ab5f767b54810b53b30841ffe7f614aa07a32be0', 'presto_json_return.R'
)
charToRaw(tail(readLines(gist), 1L))

IINM from C++ POV this raw vector should just be a subset of a character vector...

The potential kibosh for this would be encoding issues, though rawToChar also would not work for that case, so users with encoding issues can do iconv themselves I guess?

Still working on adapting my code to use RcppSimdJson & benchmarking, so another musing for the day --

I think a major choke point of my current code is some regular gc()s that are happening, which skipping rawToChar could potentially avert. IINM I am getting rawToChar on a huge string on every batch from GET, then parsing out my data & "discarding" the rather large strings (consisting of JSON objects with maybe 100s or rows and/or columns) which are now in the session's string cache (since rawToChar will do mkChar).

It's also something to keep in mind for benchmarking -- unless this phenomenon is captured, the benefit of dropping rawToChar might be understated.