eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simplifcation of nested json

shikokuchuo opened this issue · comments

First off thanks for the incredible work on this – I have no doubt this will be a hugely popular package going forward.

I intend for RcppSimdJson to replace jsonlite in my package, but ran into the following issue. I have a function which retrieves data from an external API in the following format (converted from raw):

json <- "{\"instrument\":\"USD_JPY\",\"granularity\":\"D\",\"candles\":[{\"complete\":true,\"volume\":8353,\"time\":\"2020-01-01T22:00:00.000000000Z\",\"mid\":{\"o\":\"108.651\",\"h\":\"108.866\",\"l\":\"108.212\",\"c\":\"108.561\"}},{\"complete\":true,\"volume\":21241,\"time\":\"2020-01-02T22:00:00.000000000Z\",\"mid\":{\"o\":\"108.568\",\"h\":\"108.630\",\"l\":\"107.842\",\"c\":\"108.110\"}},{\"complete\":true,\"volume\":15195,\"time\":\"2020-01-05T22:00:00.000000000Z\",\"mid\":{\"o\":\"107.879\",\"h\":\"108.506\",\"l\":\"107.772\",\"c\":\"108.372\"}}]}"

# fromJSON: fjson$candles$mid is a dataframe
fjson <- jsonlite::fromJSON(json)
fjson
#> $instrument
#> [1] "USD_JPY"
#> 
#> $granularity
#> [1] "D"
#> 
#> $candles
#>   complete volume                           time   mid.o   mid.h   mid.l
#> 1     TRUE   8353 2020-01-01T22:00:00.000000000Z 108.651 108.866 108.212
#> 2     TRUE  21241 2020-01-02T22:00:00.000000000Z 108.568 108.630 107.842
#> 3     TRUE  15195 2020-01-05T22:00:00.000000000Z 107.879 108.506 107.772
#>     mid.c
#> 1 108.561
#> 2 108.110
#> 3 108.372
class(fjson$candles$mid)
#> [1] "data.frame"

# fparse: fpar$candles$mid is a list of lists
fpar <- RcppSimdJson::fparse(json)
fpar
#> $instrument
#> [1] "USD_JPY"
#> 
#> $granularity
#> [1] "D"
#> 
#> $candles
#>   complete volume                           time
#> 1     TRUE   8353 2020-01-01T22:00:00.000000000Z
#> 2     TRUE  21241 2020-01-02T22:00:00.000000000Z
#> 3     TRUE  15195 2020-01-05T22:00:00.000000000Z
#>                                  mid
#> 1 108.651, 108.866, 108.212, 108.561
#> 2 108.568, 108.630, 107.842, 108.110
#> 3 107.879, 108.506, 107.772, 108.372
class(fpar$candles$mid)
#> [1] "list"

# What I need to finally extract:
as.numeric(fjson$candles$mid[ ,1L])
#> [1] 108.651 108.568 107.879
as.numeric(do.call(rbind, fpar$candles$mid)[ ,1L])
#> [1] 108.651 108.568 107.879

Created on 2021-07-31 by the reprex package (v2.0.0)

Basically the json for each observation contains a couple of metrics time and volume, and then a nested list containing the price data (o, h, l, c). jsonlite::fromJSON renders the nested data as a dataframe so it is easy to extract each column. fparse returns a list of lists, so I need a do.call(rbind, …) which still just returns a matrix of lists.

There is an additional overhead from doing this obviously, which cannot be measured using this 3 observation example, but once it gets into 1000+ then this does give back some of the performance gains.

I am aware that the jsonlite simplification code may well be on the R side and hence part of the reason why it is slower in the first place, but if there is a way of implementing this in the c++ code then I think it is worth doing for similar cases.

I hope the above is clear enough, but if not just let me know. Thanks.

I am not sure I understand your question.

If you only want something nested within a larger JSON object you could

  • only submit the nested part (bad, R-level part on your end, not performant)
  • subset it from the returned result (works today, slower in R)
  • write performant code relying on our existing parser only returning the nested part

I have the feeling you want something performant that is not all that generalized so maybe your best is bet is simply a local variant as in three.

The package is in a bit of hiatus and (effectively) frozen due to a combination of upstream mostly moving faster than we have time and the main developer driving this (@knapply) having moved on and no longer returning pings. So 🤷‍♂️ as the package does what it does well and fast.

Also, you can of course subset directly from fparse:

> do.call(cbind, fparse(json)$candles$mid[[1]])
     o         h         l         c        
[1,] "108.651" "108.866" "108.212" "108.561"
> as.numeric(do.call(cbind, fparse(json)$candles$mid[[1]]))
[1] 108.651 108.866 108.212 108.561
> 

The problem really seems to be that your data provider gives you OHLC in an awkward format.

Thanks for taking the time to reply so quickly and thoroughly. My question was just whether it is possible for RcppSimdJson to replicate what jsonlite does for these nested cases i.e. return dataframes within dataframes.

I totally understand if this is not to be implemented as I would not want to slow down the code for what is possibly a specialised case. I just wanted to alert you to this behaviour on the offchance that this is a simple fix. It is a shame the main dev has moved on, I hope someone can at least maintain this with upstream going forward.

The problem (in my case) as you rightly say is that the data provider provides the json in an awkward format. Even with this niggle, your package is still orders of magnitude faster!

whether it is possible for RcppSimdJson to replicate what jsonlite does

Six months ago, yes. Now, not so much as nobody is coding actively on it. You are free to try it, or hire someone, or ...

@eddelbuettel We have @NicolasJiaxin working on a new prototype wrapper around the simdjson On Demand. I doubt he can get around to such extensions this summer, however.

This being said, if there is enough genuine interest and enough fun problems, I can assuredly recruit a student for another term to work on this wrapper. Of course, we would need 'genuine' interest and people willing to provide descriptions, feedback and so forth.

Cheers!

whether it is possible for RcppSimdJson to replicate what jsonlite does

Six months ago, yes. Now, not so much as nobody is coding actively on it. You are free to try it, or hire someone, or ...

I wish I had the expertise... Thanks also for your suggestions. I was not clear, I do need all the data, just happens some of it is nested, so it is not as simple as working on part of the data.

I just wondered if a trick was being missed by not applying the simplification recursively when nesting is found. But of course this may introduce more problems than it solves - someone close to the code would know. I suspect this can be done (as it is implemented in jsonlite), but it may not be worth doing if there is a significant performance hit from checking.

I guess the current implementation just seems somewhat inelegant at the moment, that with jsonlite, I can simply extract the data I need:
fjson$candles[, 1L], fjson$candles[, 2L], fjson$candles[, 3L], fjson$candles[, 4L] - ok this is a nested dataframe, so fjson$candles[, 4L][, 1L], fjson$candles[, 4L][, 2L] etc.

wheras currently RcppSimdJson: fpar$candles[, 1L], fpar$candles[, 2L], fpar$candles[, 3L] all good, fpar$candles[, 4L] - suddenly a 1,000 item list, each of 4 elements each wrapped as a list (real world example).

Anyhow, something (hopefully simple) for whoever takes on this project. I hope you get enough indication of interest from the R community. With the proliferation of APIs, plumber etc. I fail to see how this does not become more relevant.

Try these two:

> do.call(rbind, lapply(fparse(json)$candles$mid, as.data.frame))
        o       h       l       c
1 108.651 108.866 108.212 108.561
2 108.568 108.630 107.842 108.110
3 107.879 108.506 107.772 108.372
> data.table::rbindlist(lapply(fparse(json)$candles$mid, as.data.frame))
         o       h       l       c
1: 108.651 108.866 108.212 108.561
2: 108.568 108.630 107.842 108.110
3: 107.879 108.506 107.772 108.372
> 

do.call(rbind, listobject) uses just base, data.table:;rbindlist() is a (much) faster alternative that give you a data.table.

Thanks! I think I have benchmarked this to death now. See below where I am just extracting the first column of open prices.

The problem is the outputs print fine, but each value is actually a character wrapped in a list, so if I want want useful results, I need to apply as.numeric() - and I need to unlist() first otherwise it throws an error - and that is a massive performance hit. For some reason, as.numeric() just works when it is a matrix created by rbind, my original solution.

And this is just to extract 3 numbers. You see why I am seeking an elegant solution so I can just extract the values with next to no overhead?

library(RcppSimdJson)
json <- "{\"instrument\":\"USD_JPY\",\"granularity\":\"D\",\"candles\":[{\"complete\":true,\"volume\":8353,\"time\":\"2020-01-01T22:00:00.000000000Z\",\"mid\":{\"o\":\"108.651\",\"h\":\"108.866\",\"l\":\"108.212\",\"c\":\"108.561\"}},{\"complete\":true,\"volume\":21241,\"time\":\"2020-01-02T22:00:00.000000000Z\",\"mid\":{\"o\":\"108.568\",\"h\":\"108.630\",\"l\":\"107.842\",\"c\":\"108.110\"}},{\"complete\":true,\"volume\":15195,\"time\":\"2020-01-05T22:00:00.000000000Z\",\"mid\":{\"o\":\"107.879\",\"h\":\"108.506\",\"l\":\"107.772\",\"c\":\"108.372\"}}]}"

microbenchmark::microbenchmark(
  as.numeric(unlist(do.call(rbind, lapply(fparse(json)$candles$mid, as.data.frame))[, 1L])),
  as.numeric(unlist(data.table::rbindlist(fparse(json)$candles$mid)[, 1L])),
  as.numeric(unlist(structure(fparse(json)$candles$mid, class = "data.frame", row.names = 1:4)[1L, ])),
  as.numeric(do.call(rbind, fparse(json)$candles$mid)[, 1L])
  )
#> Unit: microseconds
#>                                                                                                       expr
#>             as.numeric(unlist(do.call(rbind, lapply(fparse(json)$candles$mid,      as.data.frame))[, 1L]))
#>                             as.numeric(unlist(data.table::rbindlist(fparse(json)$candles$mid)[,      1L]))
#>  as.numeric(unlist(structure(fparse(json)$candles$mid, class = "data.frame",      row.names = 1:4)[1L, ]))
#>                                                 as.numeric(do.call(rbind, fparse(json)$candles$mid)[, 1L])
#>      min       lq      mean   median       uq       max neval
#>  496.650 538.2260 615.27473 572.5505 640.0885  3489.100   100
#>  106.897 135.0925 448.67593 146.7150 159.2365 30327.728   100
#>   43.421  52.0390  62.04515  56.6950  65.8415   407.858   100
#>   19.909  25.1810  28.46955  28.3235  31.5000    40.150   100

Created on 2021-07-31 by the reprex package (v2.0.0)

As an aside, thanks for the data.table::rbindlist() suggestion, that is fast. But not quite as fast as my manual data.frame constructor :p

commented

Jumping in here for a slight deviation:

For what it's worth, jsonify::from_json() should outperform jsonlite on nested structures because all the parsing & simplifying is done in C++, and it returns the same structure as jsonlite::fromJSON()

jsonlite::fromJSON(json)
$instrument
[1] "USD_JPY"

$granularity
[1] "D"

$candles
  complete volume                           time   mid.o   mid.h   mid.l   mid.c
1     TRUE   8353 2020-01-01T22:00:00.000000000Z 108.651 108.866 108.212 108.561
2     TRUE  21241 2020-01-02T22:00:00.000000000Z 108.568 108.630 107.842 108.110
3     TRUE  15195 2020-01-05T22:00:00.000000000Z 107.879 108.506 107.772 108.372

> jsonify::from_json(json)
$instrument
[1] "USD_JPY"

$granularity
[1] "D"

$candles
  complete volume                           time   mid.o   mid.h   mid.l   mid.c
1     TRUE   8353 2020-01-01T22:00:00.000000000Z 108.651 108.866 108.212 108.561
2     TRUE  21241 2020-01-02T22:00:00.000000000Z 108.568 108.630 107.842 108.110
3     TRUE  15195 2020-01-05T22:00:00.000000000Z 107.879 108.506 107.772 108.372

It can't outperform RcppSimdJson, but it might be useful for your use-case.

Thanks - I think further evidence that a solution is possible. The benefit from the performant parser is too great in real life examples. For a typical run, RcppSimdJson takes roughly 1/5 of the time of jsonlite. To put into perspective, the issue raised here adds back at the absolute most no more than 1/10 of the original runtime. So it still ends up 4x faster.

Unfortunately, the structure of the JSON is indeed dumb.

Since I don't currently have time to provide much meaningful feedback or focus on the package, I hope the following is a useful example of how to write a custom parser function in C++/Rcpp to deal with these sorts of scenarios when performance is a priority.

I'm not saying this is the best approach, but it should be somewhat straightforward to follow, refine as needed, and extend/customize for anyone looking for examples in the future.

This also leverages fastfloat (via RcppFastFloat) to convert the strings to doubles.

json <- "{\"instrument\":\"USD_JPY\",\"granularity\":\"D\",\"candles\":[{\"complete\":true,\"volume\":8353,\"time\":\"2020-01-01T22:00:00.000000000Z\",\"mid\":{\"o\":\"108.651\",\"h\":\"108.866\",\"l\":\"108.212\",\"c\":\"108.561\"}},{\"complete\":true,\"volume\":21241,\"time\":\"2020-01-02T22:00:00.000000000Z\",\"mid\":{\"o\":\"108.568\",\"h\":\"108.630\",\"l\":\"107.842\",\"c\":\"108.110\"}},{\"complete\":true,\"volume\":15195,\"time\":\"2020-01-05T22:00:00.000000000Z\",\"mid\":{\"o\":\"107.879\",\"h\":\"108.506\",\"l\":\"107.772\",\"c\":\"108.372\"}}]}"

cpp <- r"(
// [[Rcpp::plugins(cpp17)]]
// [[Rcpp::depends(RcppSimdJson)]]
// [[Rcpp::depends(RcppFastFloat)]]


#define STRICT_R_HEADERS

#include <Rcpp.h>

#include <simdjson.h>
#include <simdjson.cpp>

#include <fast_float/fast_float.h>


// [[Rcpp::export]]
SEXP
parse_custom(const Rcpp::CharacterVector json) {
  if (json.size() != 1 || json[0].get() == NA_STRING) {
    Rcpp::stop("not a single, non-NA string.");
  }
  
  auto json_view = std::string_view(json[0]);
  
  simdjson::dom::parser p;
  
  auto [candles, error] =
    p.parse(json_view).get_object()["candles"].get_array();
  
  if (!error) {
    const auto n_rows = std::size(candles);
    
    using Rcpp::_;
    auto out = Rcpp::DataFrame::create(
      
      _["o"] = Rcpp::DoubleVector(n_rows, NA_REAL),
      _["h"] = Rcpp::DoubleVector(n_rows, NA_REAL),
      _["l"] = Rcpp::DoubleVector(n_rows, NA_REAL),
      _["c"] = Rcpp::DoubleVector(n_rows, NA_REAL)
      
    );
    
    
    constexpr std::pair<R_xlen_t, const char*> cols[] = {
      {0, "o"}, {1, "h"}, {2, "l"}, {3, "c"}};
    
    for (auto&& col : cols) {
      
      R_xlen_t i_row = 0;
      for (auto&& c : candles) {
        if (auto [mid, mid_error] = c["mid"].get_object(); !mid_error) {
          if (auto [sv, val_error] = mid[col.second].get_string(); !val_error) {
            double     d;
            const auto res =
              fast_float::from_chars(std::cbegin(sv), std::cend(sv), d);
            if (res.ec == std::errc()) {
              static_cast<Rcpp::DoubleVector>(out[col.first])[i_row] = d;
            }
          }
        }
        ++i_row;
      }
    }
    
    return out;
  }
  
  return R_NilValue;
}


)"

Rcpp::sourceCpp(code = cpp)

parse_custom(json)
#>        o       h       l       c
#> 1 108.651 108.866 108.212 108.561
#> 2 108.568 108.630 107.842 108.110
#> 3 107.879 108.506 107.772 108.372

That's probably the best @shikokuchuo can hope for here as the lack of structure in the json result he gets makes it darn difficult to shoehorn anything directly into the parsing.

And with that I am going to close this. The post-processing is really the best one can do, thanks to @knapply there is a fast approach too, and there is nothing here where RcppSimdJson is at fault.

If think there is something RcppSimdJson should do different, feel free to reopen, preferably with a concrete proposal.