On Demand API

Question

On Demand API

knapply opened this issue 4 years ago · comments

knapply commented 4 years ago

As discussed in #52 (review)

How best to leverage simdjson's On Demand API?

Daniel Lemire commented 4 years ago

:-)

knapply · Answer 1 · Wed Oct 28 2020 10:42:46 GMT+0800 (China Standard Time)

@jkeiser, @lemire Following up on your suggestion from here after I played with it a bit.

Are there any examples of iterating through arrays of numbers, strings, bools, etc.?

I can only seem to find examples accessing the scalar values in objects using their keys.

As an example (I was trying this out on some GeoJSON), how would we access the individual points in the following (after accessing the coordinates with something like auto coords = doc["coordinates"])?

{
  "type": "LineString", 
  "coordinates": [
    [1, 2], [3, 4], [5, 6]
  ]
}

For some R context, we would need to turn "coordinates" into a 3x2 matrix (no offense intended if you're R aficionados - I don't want to assume)...

geojson <- '{
  "type": "LineString", 
  "coordinates": [
    [1, 2], [3, 4], [5, 6]
  ]
}'

parsed <- RcppSimdJson::fparse(geojson)
parsed
#> $type
#> [1] "LineString"
#> 
#> $coordinates
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    3    4
#> [3,]    5    6

... and since R is column-major, we're really building the following (with attached dimension attributes):

as.vector(parsed$coordinates)
#> [1] 1 3 5 2 4 6

John Keiser · Answer 2 · Wed Oct 28 2020 23:41:02 GMT+0800 (China Standard Time)

I'd expect it to look like:

int i=0;
for (auto point : doc["coordinates"]) {
  int j=0;
  for (double val : point) {
    matrix[i][j] = val;
    j++;
  }
  i++;
}

knapply · Answer 3 · Thu Oct 29 2020 06:54:59 GMT+0800 (China Standard Time)

Thank you. Is there a way to get an array's size (so we can tell how big matrix needs to be)?

Daniel Lemire · Answer 4 · Thu Oct 29 2020 07:01:22 GMT+0800 (China Standard Time)

@knapply rbind?

Dirk Eddelbuettel · Answer 5 · Thu Oct 29 2020 07:11:56 GMT+0800 (China Standard Time)

No, for two reasons. Those things tend to be performance killers as the underlying R data structures don't grow easily. So a general Rcpp pattern is to collect everything in C++ (hello old friend STL) and then convert at end. Plus, rbind is at the R level and we are not going 'up and down' all the time. (Unless I misread your suggestion ...)

Daniel Lemire · Answer 6 · Thu Oct 29 2020 07:13:25 GMT+0800 (China Standard Time)

I was half kidding.

Dirk Eddelbuettel · Answer 7 · Thu Oct 29 2020 07:13:30 GMT+0800 (China Standard Time)

(And @lemire for example on of my all-time fave little tools in R is data.table::rbindlist(someList) which scans your list, allocates properly and then inserts. Runs circles around the standard do.call(rbind, someList).)

Dirk Eddelbuettel · Answer 8 · Thu Oct 29 2020 07:13:46 GMT+0800 (China Standard Time)

Hah! Point taken with a grin!

knapply · Answer 9 · Thu Oct 29 2020 07:14:44 GMT+0800 (China Standard Time)

Edit... welp, I typed too slow 😬

If only. I think I over complicated the question with a matrix.

Let's say we have an array...

[1, 2, 3, 4, 5,  6]

... and we want to insert the data into a vector (R/Rcpp or STL) equivalent to this...

std::vector<double>{1, 2, 3, 4, 5,  6};

We would need to know the size of the array...

std::vector<double> out(std::size(array));     // how do we do this?

int i = 0;
for (double element : array) {
  out[i++] = element;
 }

Daniel Lemire · Answer 10 · Thu Oct 29 2020 07:16:20 GMT+0800 (China Standard Time)

std::vector<double> out;
for (double element : array) { out.push_back( element); }

knapply · Answer 11 · Thu Oct 29 2020 07:18:25 GMT+0800 (China Standard Time)

... still half kidding? :D

(I can't tell if knowing the size would even be possible with the On Demand API, but it sorta make this a non-starter).

Daniel Lemire · Answer 12 · Thu Oct 29 2020 07:21:01 GMT+0800 (China Standard Time)

The above code is standard C++, although there is room for optimizations....
https://lemire.me/blog/2012/06/20/do-not-waste-time-with-stl-vectors/

John Keiser · Answer 13 · Thu Oct 29 2020 09:13:29 GMT+0800 (China Standard Time)

Yeah, using push back is necessary, though we could probably expose an upper bound on the array size if you are willing to risk overallocation... still, I"d test the push_back method first, the cost of vector growth may well be lower than the cost of materializing a DOM.

Daniel Lemire · Answer 14 · Thu Oct 29 2020 09:29:48 GMT+0800 (China Standard Time)

For performance, you almost always want to overallocate if only temporarily.