rcppsimdjson has inefficient network downloads

Question

rcppsimdjson has inefficient network downloads

lemire opened this issue 4 years ago · comments

I think that @melsiddieg was correct to complain about the performance of RcppSimdJson::fload. Something does not add up. Given how fast RcppSimdJson, it should be roughly as fast as curl:: curl_download . But it is not!

> url<-"http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
> res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
                                       jsonlite = jsonlite::fromJSON(url),
                                       simdjson = RcppSimdJson::fload(url),
                                       times = 5L)
> print(res)
Unit: milliseconds
     expr       min        lq      mean    median        uq       max neval
 straight  567.2850  568.8655  595.1786  570.3718  580.1068  689.2641     5
 jsonlite  714.3094  721.5960  733.4962  737.1303  744.9391  749.5061     5
 simdjson 2498.4776 2616.6448 2620.8897 2629.1768 2641.9882 2718.1610     5

The file has 784 KB. You can replace tempfile() by a file name and inspect it, you find that, indeed, curl is grabbing every little byte.

So it seems that RcppSimdJson could go faster by invoking curl_download and then parsing the resulting temporary file. It is also possible to load the file directly to memory (curl_fetch_memory) but I did not want to use that as a benchmark since you might argue (rightly so) that it might be cheating.

I am not 100% clear on why there is such a difference, but it does warrant investigation.

Mohammed OE Abdallah · Answer 1 · Fri Jul 17 2020 04:57:04 GMT+0800 (China Standard Time)

I have actually located the cause of the slowdown which is util::download.file I have replaced it with curl::curl_download and its is twice as fast as jsonlite::fromJson and I think it can go faster with more curl tweaks.

Daniel Lemire · Answer 2 · Fri Jul 17 2020 05:06:44 GMT+0800 (China Standard Time)

@melsiddieg If you have done the leg work of changing the code and running benchmarks, maybe you could issue a PR and include benchmark results... this might help motivate @eddelbuettel.

Mohammed OE Abdallah · Answer 3 · Fri Jul 17 2020 05:10:27 GMT+0800 (China Standard Time)

Sure no problem, It is a simple tweak actually.

Dirk Eddelbuettel · Answer 4 · Fri Jul 17 2020 05:19:40 GMT+0800 (China Standard Time)

Please please please please hold off.

I have a few days of experience with R, expected something like this, and can tell you it is complicated.

The code does the right thing _by defaulting to method=auto. You yourself are free to choose other download methods. See help(download.file) and other resources.

Dirk Eddelbuettel · Answer 5 · Fri Jul 17 2020 05:23:38 GMT+0800 (China Standard Time)

@lemire and with all due respect, the benchmark above is silly. You are doing precisely what you belabored earlier.

Maybe this is fairer:

url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
                                      baseR = download.file(url, tempfile()),
                                      times = 5L)

(still needs a quiet= argument for both)

leading to

R> res
Unit: milliseconds
     expr      min       lq     mean   median       uq       max neval cld
 straight  520.735  525.681  569.826  547.477  577.775   677.463     5   a
    baseR 1086.744 1184.752 5415.459 1336.059 1389.182 22080.559     5   a
R>

So select curl over auto and you get a speed up. But I will argue that that belongs into your ~/.Rprofile and not our package though.

Mohammed OE Abdallah · Answer 6 · Fri Jul 17 2020 05:31:30 GMT+0800 (China Standard Time)

@eddelbuettel Rcppsimdjson with curl_download by default

Mohammed OE Abdallah · Answer 7 · Fri Jul 17 2020 05:34:15 GMT+0800 (China Standard Time)

@eddelbuettel @lemire I think someone else could implement it in a separate R package or something, still, it seems like a small imperfection to have in this otherwise brilliant package.

Dirk Eddelbuettel · Answer 8 · Fri Jul 17 2020 05:35:53 GMT+0800 (China Standard Time)

@melsiddieg Have you considered not benchmarking download and parse ? I really do not care who wins; that is like timing who boils an egg faster when you include a drive to the store to buy eggs.

Here is mine updated over all arguments for download.file. curl wins. But I am the maintainer here and I will not add a new dependency just because you cannot stop talking about a faster (and clearly, see below, chunked in parallel) download method. I have limited time, and I would prefer to concentrate on things that matter. This one doesn't because you can always separate download and parse.

Output

edd@rob:~/git/rcppsimdjson(master)$ Rscript demo/downloadBenchmark.R
Unit: milliseconds
          expr      min      lq     mean   median       uq       max neval cld
      straight  554.601  610.33  607.669  621.958  624.695   626.763     5   a
     baseRwget 2339.938 2353.82 2557.452 2385.150 2808.722  2899.631     5   a
 baseRinternal 2160.950 2395.10 5600.646 2429.357 2615.772 18402.050     5   a
  baseRlibcurl 1407.968 2458.61 2582.007 2512.940 2811.564  3718.959     5   a
     baseRauto 1399.441 2331.44 2405.339 2541.951 2781.279  2972.583     5   a
     baseRcurl 2164.464 2221.29 2423.828 2547.877 2559.365  2626.139     5   a
Unit: relative
          expr     min      lq    mean  median      uq      max neval cld
      straight 1.00000 1.00000 1.00000 1.00000 1.00000  1.00000     5   a
     baseRwget 4.21914 3.85664 4.20863 3.83491 4.49615  4.62636     5   a
 baseRinternal 3.89641 3.92427 9.21660 3.90599 4.18728 29.36048     5   a
  baseRlibcurl 2.53870 4.02832 4.24903 4.04037 4.50070  5.93360     5   a
     baseRauto 2.52333 3.81996 3.95830 4.08702 4.45222  4.74276     5   a
     baseRcurl 3.90274 3.63950 3.98873 4.09654 4.09698  4.19000     5   a
edd@rob:~/git/rcppsimdjson(master)$

Code

url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
                                      baseRauto = download.file(url, tempfile(), quiet=TRUE),
                                      baseRinternal = download.file(url, tempfile(), method="internal", quiet=TRUE),
                                      baseRlibcurl = download.file(url, tempfile(), method="libcurl", quiet=TRUE),
                                      baseRwget = download.file(url, tempfile(), method="wget", quiet=TRUE),
                                      baseRcurl = download.file(url, tempfile(), method="curl", quiet=TRUE),
                                      times = 5L)

print(res, order="median")
print(res, order="median", unit="relative")

Mohammed OE Abdallah · Answer 9 · Fri Jul 17 2020 05:48:39 GMT+0800 (China Standard Time)

@eddelbuettel I speak from the utmost respect and love to both of you and all the RcppSimdJson team. My point is that fload does both download and parse, and that the paring part is flawless and blazing fast, it is a matter of whether we care about optimizing the speed of downloads as well or not. In my opinion, It is a shame to have a blazing fast parser sitting behind slow downloaders. The whole point of RcppSimdJson is the speed, and with the default download.file this advantage is wasted for url parsing, which is a common if not the commonest use case of JSON parsers.

Dirk Eddelbuettel · Answer 10 · Fri Jul 17 2020 05:53:05 GMT+0800 (China Standard Time)

myfload(file, ...) {
   tf <- tempfile()
   curl::curl_download(file, tf)  
   fload(tf, ...)
}

Untested but you get the idea.

Dirk Eddelbuettel · Answer 11 · Fri Jul 17 2020 05:54:58 GMT+0800 (China Standard Time)

I locked this, and I would appreciate it if you could respect this by not opening another issue on the same topic. The issue is closed. We can revisit in two weeks or so which give us some time to look at things that may be more consequential that relitigating this question. Thanks for your patience.