rcppsimdjson has inefficient network downloads
lemire opened this issue · comments
I think that @melsiddieg was correct to complain about the performance of RcppSimdJson::fload
. Something does not add up. Given how fast RcppSimdJson, it should be roughly as fast as curl:: curl_download
. But it is not!
> url<-"http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
> res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
jsonlite = jsonlite::fromJSON(url),
simdjson = RcppSimdJson::fload(url),
times = 5L)
> print(res)
Unit: milliseconds
expr min lq mean median uq max neval
straight 567.2850 568.8655 595.1786 570.3718 580.1068 689.2641 5
jsonlite 714.3094 721.5960 733.4962 737.1303 744.9391 749.5061 5
simdjson 2498.4776 2616.6448 2620.8897 2629.1768 2641.9882 2718.1610 5
The file has 784 KB. You can replace tempfile()
by a file name and inspect it, you find that, indeed, curl is grabbing every little byte.
So it seems that RcppSimdJson could go faster by invoking curl_download and then parsing the resulting temporary file. It is also possible to load the file directly to memory (curl_fetch_memory) but I did not want to use that as a benchmark since you might argue (rightly so) that it might be cheating.
I am not 100% clear on why there is such a difference, but it does warrant investigation.
@melsiddieg If you have done the leg work of changing the code and running benchmarks, maybe you could issue a PR and include benchmark results... this might help motivate @eddelbuettel.
Sure no problem, It is a simple tweak actually.
Please please please please hold off.
I have a few days of experience with R, expected something like this, and can tell you it is complicated.
The code does the right thing _by defaulting to method=auto
. You yourself are free to choose other download methods. See help(download.file)
and other resources.
@lemire and with all due respect, the benchmark above is silly. You are doing precisely what you belabored earlier.
Maybe this is fairer:
url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
baseR = download.file(url, tempfile()),
times = 5L)
(still needs a quiet= argument for both)
leading to
R> res
Unit: milliseconds
expr min lq mean median uq max neval cld
straight 520.735 525.681 569.826 547.477 577.775 677.463 5 a
baseR 1086.744 1184.752 5415.459 1336.059 1389.182 22080.559 5 a
R>
So select curl
over auto
and you get a speed up. But I will argue that that belongs into your ~/.Rprofile
and not our package though.
@eddelbuettel Rcppsimdjson with curl_download by default
@eddelbuettel @lemire I think someone else could implement it in a separate R package or something, still, it seems like a small imperfection to have in this otherwise brilliant package.
@melsiddieg Have you considered not benchmarking download and parse ? I really do not care who wins; that is like timing who boils an egg faster when you include a drive to the store to buy eggs.
Here is mine updated over all arguments for download.file
. curl
wins. But I am the maintainer here and I will not add a new dependency just because you cannot stop talking about a faster (and clearly, see below, chunked in parallel) download method. I have limited time, and I would prefer to concentrate on things that matter. This one doesn't because you can always separate download and parse.
Output
edd@rob:~/git/rcppsimdjson(master)$ Rscript demo/downloadBenchmark.R
Unit: milliseconds
expr min lq mean median uq max neval cld
straight 554.601 610.33 607.669 621.958 624.695 626.763 5 a
baseRwget 2339.938 2353.82 2557.452 2385.150 2808.722 2899.631 5 a
baseRinternal 2160.950 2395.10 5600.646 2429.357 2615.772 18402.050 5 a
baseRlibcurl 1407.968 2458.61 2582.007 2512.940 2811.564 3718.959 5 a
baseRauto 1399.441 2331.44 2405.339 2541.951 2781.279 2972.583 5 a
baseRcurl 2164.464 2221.29 2423.828 2547.877 2559.365 2626.139 5 a
Unit: relative
expr min lq mean median uq max neval cld
straight 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 5 a
baseRwget 4.21914 3.85664 4.20863 3.83491 4.49615 4.62636 5 a
baseRinternal 3.89641 3.92427 9.21660 3.90599 4.18728 29.36048 5 a
baseRlibcurl 2.53870 4.02832 4.24903 4.04037 4.50070 5.93360 5 a
baseRauto 2.52333 3.81996 3.95830 4.08702 4.45222 4.74276 5 a
baseRcurl 3.90274 3.63950 3.98873 4.09654 4.09698 4.19000 5 a
edd@rob:~/git/rcppsimdjson(master)$
Code
url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
baseRauto = download.file(url, tempfile(), quiet=TRUE),
baseRinternal = download.file(url, tempfile(), method="internal", quiet=TRUE),
baseRlibcurl = download.file(url, tempfile(), method="libcurl", quiet=TRUE),
baseRwget = download.file(url, tempfile(), method="wget", quiet=TRUE),
baseRcurl = download.file(url, tempfile(), method="curl", quiet=TRUE),
times = 5L)
print(res, order="median")
print(res, order="median", unit="relative")
@eddelbuettel I speak from the utmost respect and love to both of you and all the RcppSimdJson team. My point is that fload does both download and parse, and that the paring part is flawless and blazing fast, it is a matter of whether we care about optimizing the speed of downloads as well or not. In my opinion, It is a shame to have a blazing fast parser sitting behind slow downloaders. The whole point of RcppSimdJson is the speed, and with the default download.file this advantage is wasted for url parsing, which is a common if not the commonest use case of JSON parsers.
myfload(file, ...) {
tf <- tempfile()
curl::curl_download(file, tf)
fload(tf, ...)
}
Untested but you get the idea.
I locked this, and I would appreciate it if you could respect this by not opening another issue on the same topic. The issue is closed. We can revisit in two weeks or so which give us some time to look at things that may be more consequential that relitigating this question. Thanks for your patience.