eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should/does RcppSimdJson::fparse accept connection object as input

statquant opened this issue · comments

Hello,
When I am using RcppSimdJson it is usually to ingest data outputed from another external process.
I would typically do RcppSimdJson::fparse(system(command = cmd, intern = TRUE))

example:

library(RcppSimdJson)                                                                    
jsonfile <- system.file("jsonexamples", "twitter.json", package = "RcppSimdJson")
cmd <- sprintf("cat %s", jsonfile)
res <- fparse(paste(system(cmd, intern = TRUE), collapse = "\n"))
# obviously in this case one would use fload

My understanding (sorry if I am wrong) is that connections are not supported by fparse, I think that would be a pretty useful feature to be able to do something like RcppSimdJson::fparse(pipe(cmd)) "a la" data.table::fread(cmd), as per Jim Hester blog post.

Is there anything that would make this impossible, and if not would you consider a PR for it ?

Please check simdjson upstream first: the issue is seekability and i do not think we can assume it. Similar reason for example why data.table::fread does (sadly!!) not work on compressed. (Now, IIRC upstream is also moving towards working on streams so maybe "one day" we can do this but I think we cannot right now.)

thanks will check and return
re data.table::fread, you lost me here because it works on gz files, would you mind elaborating a bit further ?

I might be misreading the issue but I think it is all about convenience. You can always take the input stream, write the data in a buffer and pass this buffer to the parser. Instead of doing it manually, you can have a wrapper that does it for you. This could be done in pure R and is no different from, say, retrieving the data from the network.

The simdjson library supports what we can document streams, but the concept is different. It is the scenario where many JSON document are stored in sequence, and you want to iterate through them quickly.

Daniel, all correct "in principle". In practice, "connections" are a special thing in R.

See help(connections) -- it is a powerful notion encompassing files, URL access, gz/bzip2/xz/zip compressed files, fifos, sockets, ... Are you sure you can play with all of those? That was the user question that brought us here...

Hello, yes this is mainly for convenience, maybe a bit more than that if you consider my use case fparse(system(cmd, intern = TRUE)) because I had cases where the output of system(cmd, intern = TRUE) generated a "Character" that was breaching R limit.
I would also assume it to be faster as instead of running the command and the parsing sequentially they would be piped. In unix parlance that would be named pipes (a process (mine) would write into a fifo and another (simdjson) would ingest it)

I stand corrected on fread() then. That is something that was an open issue for a long time as I recall. (And I myself saved simulation results to gzlib-opened streams back in plain C/C++ the early 1990s during grad school so I knew it was easy/feasible, but for some reason data.table farmed out to R.utils for a long time.)

Are you sure you can play with all of those?

We take as an input a string of bytes representing a JSON document. Actually, we take a pointer to an address to these bytes. But we also do not care where it comes from.

I am sure I am misunderstanding the issue, but I do not see how it could differ from reading the JSON from the network.

Ok, so maybe it could work. Then again, if I were in @statquant's shoes (which I am not), I would probably just dump the content to file and move on. JSON is, ultimately, a somewhat stupid and untyped (yet as we know ubiquitous) format, the performance wonders of simdjson not withstanding. Binary interchange sans formatting and reparsing, no matter how fast, is better. Oh well.

Ok, so maybe it could work.

So maybe a PR is invited?

Well I never say no to well-executed PRs. As articulated elsewhere, with a preference for

  • discussing scope of changes rather than throwing unannounced PRs over the fence
  • generally, smaller is better than larger,
  • added tests are always plus,
  • and ChangeLog and NEWS entries are nice too

The caveat being that @knapply and I were actually in that 'space' (as fparse() and fload() are different) and put a helper function into R/utils.R to actually download first. It may well be that we overlooked some magic but we decided that ops on local file would be best. YMMV.

Well I never say no to well-executed PRs

unfortunately I've never produced a well-executed PR...

But I'll revert if/when I have something worth sharing. Happy to close this issue/reopen if/when needed.
Many thanks