fload url still slower than jsonlite fromJson

Question

fload url still slower than jsonlite fromJson

melsiddieg opened this issue 4 years ago · comments

Mohammed OE Abdallah commented 4 years ago

Dear Rcppsimdjson Team

First, Thanks a ton for this amazing work, well done! I have been waiting for this package for months, and I am so happy it is finally working. I have been begrudgingly 😄 using pysimdjson and waiting for the R port, thanks for making this happen. I am biomedical data scientist and R package developer and educator. I tested your package in the Covid-19 data (tons of json files) and it works perfectly and insanely fast (I could parse out paper abstracts in the speed of ~2750 files per second) on my laptop, I am putting a blog post about this in the near future. However, when parsing json directly from url, the Rcppsimd fload is slightly slower than {jsonlite) fromJson, even for large json results. I wonder if there is a plan to accelerate this functionality, it would be another killer feature. Note that I am already satisfied by the fact that we can finally just get our desired data from url data using the query parameter in fload.
Best regards
Mohammed O.E Abdallah

Dirk Eddelbuettel · Answer 1 · Thu Jul 16 2020 19:56:15 GMT+0800 (China Standard Time)

when parsing json directly from url,

You need to give us something reproducible. And even then, the download part of it will be constant for both, so we're really back to parsing from file, no?

Daniel Lemire · Answer 2 · Thu Jul 16 2020 20:45:32 GMT+0800 (China Standard Time)

When downloading from a network, it is entirely possible for the network component to dominate the cost, and for variations in download speeds to dominate the statistical dispersion.

@melsiddieg What are your network speeds? E.g., you can download much faster than you can parse... tens of thousands of documents per second?

Mohammed OE Abdallah · Answer 3 · Thu Jul 16 2020 21:02:20 GMT+0800 (China Standard Time)

@lemire Thanks again for this amazing piece of work, see above my CORD-19 use case, and how you helped me parsing tons of json files in record time.

I have a decent internet speed about ~5MB per second. I guess download speed would vary from time to time, I have tested this many times, and I expected Rcppsimdjson to be much faster if the underlying curl download implementation is equivalent between Rcppsimdjson and jsonlite.

This will download a 200, by 12 dataframe

url<-"http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
System.time(res<-Rcppsimdjson::fload(url))
System.time(res2<-jsonlite::fromJson(url))

Dirk Eddelbuettel · Answer 4 · Thu Jul 16 2020 21:14:55 GMT+0800 (China Standard Time)

@melsiddieg By now both Daniel and I tried to explain to you that network access is not part of what we control and hence not something you should measure comparatively.

So as I politely as I can, now for a third time: can you please time local file access comparisons?

(Also: Take a look at the extensive GitHub documentation for code formatting. It is really a two minute read and it will make you writing easier for us to read and thus for likely to be give better attention.)

Mohammed OE Abdallah · Answer 5 · Thu Jul 16 2020 21:26:02 GMT+0800 (China Standard Time)

@eddelbuettel sorry for the bad formatting. Regarding local file access, I can confirm that Rcppsimdjson is at least 20 to 30 times faster than jsonlite. I am typing from a phone, so sorry about that.

Dirk Eddelbuettel · Answer 6 · Thu Jul 16 2020 21:35:31 GMT+0800 (China Standard Time)

I edited your last post for you, you can look into it.

You may also want to learn about packages rbenchmark or microbenchmark which allow you to compare multiple expressions at once.

As we now demonstrated that there is no parsing slow down, and the "problem was between keyboard and chair" as you timed primarily the (apparently long and slow download) by mistake.

knapply · Answer 7 · Thu Jul 16 2020 21:40:02 GMT+0800 (China Standard Time)

For completeness (and because a 30x speed jump is worth celebrating):

url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"

temp_file <- tempfile()
download.file(url, temp_file)

bench <- microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::fload(temp_file),
    jsonlite = jsonlite::fromJSON(temp_file),
    times = 3
)

print(bench)
#> Unit: milliseconds
#>      expr        min         lq      mean     median        uq       max neval
#>  simdjson   4.850144   4.935208  21.81623   5.020272  30.29927  55.57827     3
#>  jsonlite 143.865467 147.642903 149.77416 151.420340 152.72850 154.03666     3

print(bench, unit = "relative")
#> Unit: relative
#>      expr     min       lq     mean   median       uq      max neval
#>  simdjson  1.0000  1.00000 1.000000  1.00000 1.000000 1.000000     3
#>  jsonlite 29.6621 29.91625 6.865263 30.16178 5.040666 2.771527     3

Daniel Lemire · Answer 8 · Thu Jul 16 2020 21:46:16 GMT+0800 (China Standard Time)

@melsiddieg If you are downloading at, say, 5MB/s, then almost all of the time, all of the latency, will be due to the network transfert. So when you are benchmarking parser A and parser B... here is what I think happens...

99% of the time: downloading the data

1% of the time: parsing the data with parser A or parser B

Now if this is approximately correct, and it should be because you report really high speeds when parsing from a local source... then if you compare parser A and parser B, and if they have the same download speed, then at most 1% of the total time difference can be explained away by the performance of the parser. Almost all of the performance difference will be due to network variations.

Can parser A have better network speed than parser B?

How could it be?

Let us consider one reason for why that might be. One could benefit from a buffered input. Right? So if you ask for a network ressource, and then you ask again for the same network resource, the second time should be (statistically) faster because of all the caching done. For example, your local network stack or your router, or whatever, could be caching the ressource.

In fact, the very same thing is true of disk... but the effect can be larger with networks.

So what you want to do is to benchmark these components separately. Grab the data from the network. Benchmark that. Then parse it, benchmark that. The parsing should be independent from everything else.

What if the parser are networking are integrated? It is possible for a parser to have integrated the network access... Your parser could know HTTP and deal with packets. It is unlikely that such a beast exist, but if it did, and you wanted to run benchmarks, then you'd have to control for network variations. Probably you'd run benchmarks over a simulated network connection.

Mohammed OE Abdallah · Answer 9 · Thu Jul 16 2020 22:02:55 GMT+0800 (China Standard Time)

I will do further digging and report my results.

…

On Thu, Jul 16, 2020 at 3:46 PM Daniel Lemire ***@***.***> wrote: @melsiddieg <https://github.com/melsiddieg> If you are downloading at, say, 5MB/s, then almost all of the time, all of the latency, will be due to the network transfert. So when you are benchmarking parser A and parser B... here is what I think happens... 99% of the time: downloading the data 1% of the time: parsing the data with parser A or parser B Now if this is approximately correct, and it should be because you report really high speeds when parsing from a local source... then if you compare parser A and parser B, and if they have the same download speed, then at most 1% of the total time difference can be explained away by the performance of the parser. Almost all of the performance difference will be due to network variations. Can parser A have better network speed than parser B? How could it be? Let us consider one reason for why that might be. One could benefit from a buffered input. Right? So if you ask for a network ressource, and then you ask again for the same network resource, the second time should be (statistically) faster because of all the caching done. For example, your local network stack or your router, or whatever, could be caching the ressource. In fact, the very same thing is true of disk... but the effect can be larger with networks. So what you want to do is to benchmark these components separately. Grab the data from the network. Benchmark that. Then parse it, benchmark that. The parsing should be independent from everything else. What if the parser are networking are integrated? It is possible for a parser to have integrated the network access... Your parser could know HTTP and deal with packets. It is unlikely that such a beast exist, but if it did, and you wanted to run benchmarks, then you'd have to control for network variations. Probably you'd run benchmarks over a simulated network connection. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACEHHBWBNJIY522ZTROJFATR34ADVANCNFSM4O36PWFQ> .

knapply · Answer 10 · Mon Aug 10 2020 06:52:23 GMT+0800 (China Standard Time)

@melsiddieg @lemire @eddelbuettel

Some recent enhancements are extremely relevant here and worth noting.

The default branch's fload() now has compressed_download= parameter that, when TRUE, will download the JSON as gzip-compressed data for some extra speed.

The main takeaway here is that, with compressed_download=TRUE, fload() holds its own without any additional dependencies.

It's also now extremely flexible since fparse() can handle raw vectors directly, both individually and in bulk when provided in a list. This makes it easily and efficiently extendable as it removes the requirement to materialize the R strings (rawToChar()) that you'll encounter in any of the alternatives (from what I can tell).

For example, the raw vectors returned from curl::curl_fetch_memory(URL)$content or httr::GET(REST_API)$content can be parsed directly.

What follows are most (maybe all?) of the relevant strategies to consider (with some benchmarks).

Original Data

original_url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"

original_bench <- microbenchmark::microbenchmark(
    simdj_fload = RcppSimdJson::fload(original_url),
    simd_compressed = RcppSimdJson::fload(original_url, compressed_download = TRUE),
    #                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^ now in the default branch
    simd_curl_download = RcppSimdJson::fload(curl::curl_download(original_url, tempfile())),
    simd_curl_fetch_mem = RcppSimdJson::fparse(curl::curl_fetch_memory(original_url)$content)
    ,
    js_fromJSON = jsonlite::fromJSON(original_url),
    js_curl_download = jsonlite::fromJSON(curl::curl_download(original_url, tempfile())),
    js_curl_fetch_mem = jsonlite::fromJSON(rawToChar(curl::curl_fetch_memory(original_url)$content))
    ,
    times = 3
)

print(original_bench, order = "median")

## Unit: seconds
##                 expr      min       lq     mean   median       uq      max neval
##   simd_curl_download 1.117939 1.123318 1.181905 1.128697 1.213889 1.299081     3
##  simd_curl_fetch_mem 1.105317 1.134276 1.156490 1.163234 1.182076 1.200918     3
##      simd_compressed 1.218079 1.223796 1.226708 1.229513 1.231022 1.232532     3
##          js_fromJSON 1.014312 1.140381 1.209773 1.266449 1.307504 1.348559     3
##    js_curl_fetch_mem 1.322404 1.359337 1.398827 1.396269 1.437038 1.477807     3
##     js_curl_download 1.259240 1.362296 1.437664 1.465352 1.526877 1.588401     3
##          simdj_fload 1.580044 1.741765 1.833092 1.903485 1.959616 2.015747     3

Scaling Up

big_json_url <- paste0("http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=", 
                       2000,
                       "&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false")

big_json_bench <- microbenchmark::microbenchmark(
    simd_fload = RcppSimdJson::fload(big_json_url),
    simd_compressed = RcppSimdJson::fload(big_json_url, compressed_download = TRUE),
    simd_curl_download = RcppSimdJson::fload(curl::curl_download(big_json_url, tempfile())),
    simd_curl_fetch_mem = RcppSimdJson::fparse(curl::curl_fetch_memory(big_json_url)$content)
    ,
    js_fromJSON = jsonlite::fromJSON(big_json_url),
    js_curl_download = jsonlite::fromJSON(curl::curl_download(big_json_url, tempfile())),
    js_curl_fetch_mem = jsonlite::fromJSON(rawToChar(curl::curl_fetch_memory(big_json_url)$content))
    ,
    times = 3
)

print(big_json_bench, order = "median")

## Unit: seconds
##                 expr      min       lq     mean   median       uq      max neval
##   simd_curl_download 2.064851 2.067025 2.133874 2.069198 2.168385 2.267572     3
##  simd_curl_fetch_mem 2.162997 2.168197 2.393001 2.173397 2.508003 2.842610     3
##      simd_compressed 2.032601 2.126485 2.230130 2.220369 2.328895 2.437420     3
##           simd_fload 3.090080 3.225657 3.500251 3.361235 3.705336 4.049438     3
##          js_fromJSON 3.619076 3.678450 3.765256 3.737824 3.838345 3.938867     3
##     js_curl_download 3.654222 3.794303 4.018802 3.934384 4.201093 4.467802     3
##    js_curl_fetch_mem 3.800374 3.959049 4.284949 4.117724 4.527237 4.936750     3

Multiple URLs

curl_async <- function(urls, .parse) { # rather than cram this inline...
    out <- list()
    for (i in seq_along(urls)) {
        curl::curl_fetch_multi(urls[[i]],
                               done = function(res) out <<- c(out, list(res$content)),
                               fail = function(msg) stop(msg))
    }
    curl::multi_run()
    .parse(out)
}

multi_url <- paste0("http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=", 
                    c(100, 200, 300, 400),
                    "&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false")

multi_json_bench <- microbenchmark::microbenchmark(
    simd_fload = RcppSimdJson::fload(multi_url),
    
    simd_compressed = RcppSimdJson::fload(multi_url, compressed_download = TRUE),
    
    simd_curl_download = RcppSimdJson::fload(
        vapply(multi_url, curl::curl_download, character(1L), destfile = tempfile())
    ),
    
    simd_curl_fetch_mem = RcppSimdJson::fparse(
        lapply(multi_url, function(.x) curl::curl_fetch_memory(.x)$content)
    ),
    
    simd_curl_async = curl_async(multi_url, .parse = RcppSimdJson::fparse)
    
    ,
    
    js_fromJSON = lapply(multi_url, jsonlite::fromJSON),
    
    js_curl_download = lapply(
        vapply(multi_url, curl::curl_download, character(1L), destfile = tempfile()), 
        jsonlite::fromJSON
    ),
    
    js_curl_fetch_mem = lapply(
        multi_url, 
        function(.x) jsonlite::fromJSON(rawToChar(curl::curl_fetch_memory(.x)$content))
    ),
    
    js_curl_async = curl_async(
        multi_url, 
        .parse = function(.raw_list) lapply(.raw_list, function(.raw_vec) jsonlite::fromJSON(rawToChar(.raw_vec)))
    )
    
    ,
    
    times = 3
)

print(multi_json_bench, order = "median")

## Unit: seconds
##                 expr      min       lq     mean   median       uq      max neval
##      simd_curl_async 1.164452 1.183243 1.194229 1.202034 1.209117 1.216200     3
##      simd_compressed 1.241268 1.269233 1.512484 1.297198 1.648091 1.998985     3
##           simd_fload 1.905494 1.939829 2.020154 1.974165 2.077484 2.180803     3
##        js_curl_async 1.995319 2.041539 2.080569 2.087758 2.123195 2.158631     3
##   simd_curl_download 3.826150 3.934164 4.014818 4.042179 4.109153 4.176126     3
##  simd_curl_fetch_mem 3.944343 4.056987 4.390562 4.169632 4.613671 5.057710     3
##    js_curl_fetch_mem 4.531811 4.830516 6.326096 5.129222 7.223239 9.317256     3
##          js_fromJSON 5.249267 5.270221 5.323240 5.291175 5.360226 5.429277     3
##     js_curl_download 5.370540 5.816787 6.579072 6.263034 7.183338 8.103642     3

Mohammed OE Abdallah · Answer 11 · Mon Aug 10 2020 16:56:05 GMT+0800 (China Standard Time)

@knapply Many thanks, I am glad this has made it to RcppSimdJson. I have personally created my own package https://github.com/melsiddieg/furly to solve this problem using RcppSimdJson::fparse on async curl which is almost identical to your simd_curl_async function. I needed to do that as I do a lot of multiple URLs JSON parsing on my bioinformatics job. However, this really great news!

Mohammed OE Abdallah · Answer 12 · Mon Aug 10 2020 21:02:56 GMT+0800 (China Standard Time)

@knapply Adding the ability to handle raw vectors to fparse is fantastic news! I will update my package accordingly.