no JSON found for valid URL
nathaneastwood opened this issue · comments
Attempting to download the Fantasy Premier League data and I get an error
Reproducible example:
RcppSimdJson::fload("https://fantasy.premierleague.com/api/element-summary/599/", verbose = TRUE)
# trying URL 'https://fantasy.premierleague.com/api/element-summary/599/'
# downloaded 0 bytes
#
# Error in .load_json(json = input, query = query, empty_array = empty_array, :
# Empty: no JSON found
(adjust the 599 as needed, though this is the last player in the game). You can visit the url and see that the data are in fact there.
FWIW, the data download with jsonlite (only name output included):
names(jsonlite::fromJSON("https://fantasy.premierleague.com/api/element-summary/599/"))
# [1] "fixtures" "history" "history_past"
You can also wget
the data with ease
wget https://fantasy.premierleague.com/api/element-summary/599/ -O football.json
Similar JSON files from the site such as this also do not work. Let me know if you need more.
It works in an interactive session, but I can reproduce by running in a non-interactive session (e.g. reprex::reprex()
).
Interactive Session
interactive()
#> [1] TRUE
target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
RcppSimdJson::fload(target_url)
#> $fixtures
#> id code team_h team_h_score team_a team_a_score event finished minutes provisional_start_time kickoff_time event_name is_home difficulty
#> 1 82 2128369 10 NA 1 NA 9 FALSE 0 FALSE 2020-11-22T16:30:00Z Gameweek 9 TRUE 4
#> 2 93 2128380 7 NA 10 NA 10 FALSE 0 FALSE 2020-11-28T17:30:00Z Gameweek 10 FALSE 3
#> 3 102 2128389 5 NA 10 NA 11 FALSE 0 FALSE 2020-12-05T15:00:00Z Gameweek 11 FALSE 4
#> 4 113 2128400 10 NA 19 NA 12 FALSE 0 FALSE 2020-12-12T15:00:00Z Gameweek 12 TRUE 3
#> 5 122 2128409 10 NA 14 NA 13 FALSE 0 FALSE 2020-12-15T19:45:00Z Gameweek 13 TRUE 2
#> 6 134 2128421 13 NA 10 NA 14 FALSE 0 FALSE 2020-12-19T15:00:00Z Gameweek 14 FALSE 4
#> 7 142 2128429 10 NA 4 NA 15 FALSE 0 FALSE 2020-12-26T15:00:00Z Gameweek 15 TRUE 2
#> 8 158 2128445 18 NA 10 NA 16 FALSE 0 FALSE 2020-12-28T15:00:00Z Gameweek 16 FALSE 2
#> 9 167 2128454 17 NA 10 NA 17 FALSE 0 FALSE 2021-01-02T15:00:00Z Gameweek 17 FALSE 4
#> 10 172 2128459 10 NA 16 NA 18 FALSE 0 FALSE 2021-01-12T19:45:00Z Gameweek 18 TRUE 2
#> 11 182 2128469 10 NA 3 NA 19 FALSE 0 FALSE 2021-01-16T15:00:00Z Gameweek 19 TRUE 2
#> 12 196 2128483 14 NA 10 NA 20 FALSE 0 FALSE 2021-01-27T19:45:00Z Gameweek 20 FALSE 3
#> 13 204 2128491 9 NA 10 NA 21 FALSE 0 FALSE 2021-01-30T15:00:00Z Gameweek 21 FALSE 4
#> 14 212 2128499 10 NA 7 NA 22 FALSE 0 FALSE 2021-02-02T19:45:00Z Gameweek 22 TRUE 3
#> 15 222 2128509 10 NA 6 NA 23 FALSE 0 FALSE 2021-02-06T15:00:00Z Gameweek 23 TRUE 2
#> 16 229 2128516 1 NA 10 NA 24 FALSE 0 FALSE 2021-02-13T15:00:00Z Gameweek 24 FALSE 4
#> 17 248 2128535 20 NA 10 NA 25 FALSE 0 FALSE 2021-02-20T15:00:00Z Gameweek 25 FALSE 3
#> 18 252 2128539 10 NA 2 NA 26 FALSE 0 FALSE 2021-02-27T15:00:00Z Gameweek 26 TRUE 3
#> 19 268 2128555 19 NA 10 NA 27 FALSE 0 FALSE 2021-03-06T15:00:00Z Gameweek 27 FALSE 3
#> 20 273 2128560 10 NA 5 NA 28 FALSE 0 FALSE 2021-03-13T15:00:00Z Gameweek 28 TRUE 4
#> 21 282 2128569 8 NA 10 NA 29 FALSE 0 FALSE 2021-03-20T15:00:00Z Gameweek 29 FALSE 2
#> 22 293 2128580 10 NA 15 NA 30 FALSE 0 FALSE 2021-04-03T14:00:00Z Gameweek 30 TRUE 3
#> 23 304 2128591 12 NA 10 NA 31 FALSE 0 FALSE 2021-04-10T14:00:00Z Gameweek 31 FALSE 5
#> 24 313 2128600 10 NA 11 NA 32 FALSE 0 FALSE 2021-04-17T14:00:00Z Gameweek 32 TRUE 4
#> 25 322 2128609 10 NA 13 NA 33 FALSE 0 FALSE 2021-04-24T14:00:00Z Gameweek 33 TRUE 4
#> 26 329 2128616 3 NA 10 NA 34 FALSE 0 FALSE 2021-05-01T14:00:00Z Gameweek 34 FALSE 3
#> 27 342 2128629 10 NA 17 NA 35 FALSE 0 FALSE 2021-05-08T14:00:00Z Gameweek 35 TRUE 4
#> 28 350 2128637 4 NA 10 NA 36 FALSE 0 FALSE 2021-05-11T18:45:00Z Gameweek 36 FALSE 2
#> 29 366 2128653 16 NA 10 NA 37 FALSE 0 FALSE 2021-05-15T14:00:00Z Gameweek 37 FALSE 3
#> 30 372 2128659 10 NA 18 NA 38 FALSE 0 FALSE 2021-05-23T15:00:00Z Gameweek 38 TRUE 2
#>
#> $history
#> element fixture opponent_team total_points was_home kickoff_time team_h_score team_a_score round minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved
#> 1 599 72 6 0 FALSE 2020-11-07T15:00:00Z 4 1 8 0 0 0 0 0 0 0
#> penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index value transfers_balance selected transfers_in transfers_out
#> 1 0 0 0 0 0 0 0.0 0.0 0.0 0.0 45 0 0 0 0
#>
#> $history_past
#> NULL
Long-story-short: it turns out there's an issue with utils::download.file()
when run non-interactively:
interactive()
#> [1] FALSE
target_url <- "http://fantasy.premierleague.com/api/element-summary/599/"
temp_file <- tempfile(fileext = ".json")
download.file(target_url, destfile = temp_file, method = "auto")
file.size(temp_file)
#> [1] 0
download.file(target_url, destfile = temp_file, method = "curl")
file.size(temp_file)
#> [1] 0
download.file(target_url, destfile = temp_file, method = "libcurl")
file.size(temp_file)
#> [1] 0
download.file(target_url, destfile = temp_file, method = "wget")
file.size(temp_file)
#> [1] 8226
@nathaneastwood, can you provide your sessionInfo()
and verify this is happening in a non-interactive session on your end?
If that's the case, here's a temporary workaround until we find the fix (since you have wget
available):
interactive()
#> [1] FALSE
fload_url_workaround <- function(URL) {
temp_file <- tempfile()
download.file(
url = URL,
destfile = temp_file,
method = "wget" # works
)
RcppSimdJson::fload(temp_file)
}
target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
fload_url_workaround(target_url)
#> $fixtures
#> id code team_h team_h_score team_a team_a_score event finished minutes provisional_start_time kickoff_time event_name is_home difficulty
#> 1 82 2128369 10 NA 1 NA 9 FALSE 0 FALSE 2020-11-22T16:30:00Z Gameweek 9 TRUE 4
#> 2 93 2128380 7 NA 10 NA 10 FALSE 0 FALSE 2020-11-28T17:30:00Z Gameweek 10 FALSE 3
#> 3 102 2128389 5 NA 10 NA 11 FALSE 0 FALSE 2020-12-05T15:00:00Z Gameweek 11 FALSE 4
#> 4 113 2128400 10 NA 19 NA 12 FALSE 0 FALSE 2020-12-12T15:00:00Z Gameweek 12 TRUE 3
#> 5 122 2128409 10 NA 14 NA 13 FALSE 0 FALSE 2020-12-15T19:45:00Z Gameweek 13 TRUE 2
#> 6 134 2128421 13 NA 10 NA 14 FALSE 0 FALSE 2020-12-19T15:00:00Z Gameweek 14 FALSE 4
#> 7 142 2128429 10 NA 4 NA 15 FALSE 0 FALSE 2020-12-26T15:00:00Z Gameweek 15 TRUE 2
#> 8 158 2128445 18 NA 10 NA 16 FALSE 0 FALSE 2020-12-28T15:00:00Z Gameweek 16 FALSE 2
#> 9 167 2128454 17 NA 10 NA 17 FALSE 0 FALSE 2021-01-02T15:00:00Z Gameweek 17 FALSE 4
#> 10 172 2128459 10 NA 16 NA 18 FALSE 0 FALSE 2021-01-12T19:45:00Z Gameweek 18 TRUE 2
#> 11 182 2128469 10 NA 3 NA 19 FALSE 0 FALSE 2021-01-16T15:00:00Z Gameweek 19 TRUE 2
#> 12 196 2128483 14 NA 10 NA 20 FALSE 0 FALSE 2021-01-27T19:45:00Z Gameweek 20 FALSE 3
#> 13 204 2128491 9 NA 10 NA 21 FALSE 0 FALSE 2021-01-30T15:00:00Z Gameweek 21 FALSE 4
#> 14 212 2128499 10 NA 7 NA 22 FALSE 0 FALSE 2021-02-02T19:45:00Z Gameweek 22 TRUE 3
#> 15 222 2128509 10 NA 6 NA 23 FALSE 0 FALSE 2021-02-06T15:00:00Z Gameweek 23 TRUE 2
#> 16 229 2128516 1 NA 10 NA 24 FALSE 0 FALSE 2021-02-13T15:00:00Z Gameweek 24 FALSE 4
#> 17 248 2128535 20 NA 10 NA 25 FALSE 0 FALSE 2021-02-20T15:00:00Z Gameweek 25 FALSE 3
#> 18 252 2128539 10 NA 2 NA 26 FALSE 0 FALSE 2021-02-27T15:00:00Z Gameweek 26 TRUE 3
#> 19 268 2128555 19 NA 10 NA 27 FALSE 0 FALSE 2021-03-06T15:00:00Z Gameweek 27 FALSE 3
#> 20 273 2128560 10 NA 5 NA 28 FALSE 0 FALSE 2021-03-13T15:00:00Z Gameweek 28 TRUE 4
#> 21 282 2128569 8 NA 10 NA 29 FALSE 0 FALSE 2021-03-20T15:00:00Z Gameweek 29 FALSE 2
#> 22 293 2128580 10 NA 15 NA 30 FALSE 0 FALSE 2021-04-03T14:00:00Z Gameweek 30 TRUE 3
#> 23 304 2128591 12 NA 10 NA 31 FALSE 0 FALSE 2021-04-10T14:00:00Z Gameweek 31 FALSE 5
#> 24 313 2128600 10 NA 11 NA 32 FALSE 0 FALSE 2021-04-17T14:00:00Z Gameweek 32 TRUE 4
#> 25 322 2128609 10 NA 13 NA 33 FALSE 0 FALSE 2021-04-24T14:00:00Z Gameweek 33 TRUE 4
#> 26 329 2128616 3 NA 10 NA 34 FALSE 0 FALSE 2021-05-01T14:00:00Z Gameweek 34 FALSE 3
#> 27 342 2128629 10 NA 17 NA 35 FALSE 0 FALSE 2021-05-08T14:00:00Z Gameweek 35 TRUE 4
#> 28 350 2128637 4 NA 10 NA 36 FALSE 0 FALSE 2021-05-11T18:45:00Z Gameweek 36 FALSE 2
#> 29 366 2128653 16 NA 10 NA 37 FALSE 0 FALSE 2021-05-15T14:00:00Z Gameweek 37 FALSE 3
#> 30 372 2128659 10 NA 18 NA 38 FALSE 0 FALSE 2021-05-23T15:00:00Z Gameweek 38 TRUE 2
#>
#> $history
#> element fixture opponent_team total_points was_home kickoff_time team_h_score team_a_score round minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved
#> 1 599 72 6 0 FALSE 2020-11-07T15:00:00Z 4 1 8 0 0 0 0 0 0 0
#> penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index value transfers_balance selected transfers_in transfers_out
#> 1 0 0 0 0 0 0 0.0 0.0 0.0 0.0 45 0 0 0 0
#>
#> $history_past
#> NULL
Wow. What the effing eff.
That's the type of host-system-dependent-hell ("do we have wget
yes/no ?") I'd rather not go down. Ditto with all things proxies and what not. Part of me just wants to document it ("if your remote URL is non-standard expect surprises") and move on.
Do we understand the problem here? What is non-standard about this URI?
Standard URI: https://some.server.com/some/path/file.json
Non-standard: https://some.server.com/foo/bar
If you scroll up you see the URL that @nathaneastwood used. It is of the 2nd form. I had verified when he DMed me that it fails and suggested to file an issue. I had NOT grok'ed that interactive mode, and however R implements download.file()
[ in short: many options from internal to external to building with libcurl to ... ] but @knapply was more rigorous here.
But in short, none of this has anything to with simdjson. All this is inside baseball about R tries (and fails) to be helpful as a shell and environment. Which is, after all, really hard. Across OS. And builds.
Thanks for the suggestion @knapply but I think I'd rather stick to using jsonlite as, whilst your solution might work for me, it wouldn't be great inside a distributed package.
You could write a two-line helper function that downloads to a local tempfile, possibly even giving the tempfile a json extension (yay my patch to R years ago to enable a tempfile extension) and then point RcppSimdJson at the now-local file.
That's not a bad suggestion however I was really hoping to take advantage of the speed element of RcppSimdJson though as I need to grab all 599 URLs in one go, i.e.
https://fantasy.premierleague.com/api/element-summary/1/
...
https://fantasy.premierleague.com/api/element-summary/599/
I don't want to "download" multiple files at once since I will be restricted by network speeds and downloading that many files isn't ideal either.
FWIW:
r$> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RcppSimdJson_0.1.3
loaded via a namespace (and not attached):
[1] compiler_4.0.3 Rcpp_1.0.5 jsonlite_1.7.1
@eddelbuettel Now that you explain it, it does feel like documentation is the way to go... and possibly an informative error message when the download fails?
I don't want to "download" multiple files at once since I will be restricted by network speeds
I probably misunderstand but you will have to download the files no matter what.
If I had lots of files to parse, I’d be tempted to multithread the processing. Issue multiple file downloads at the same time (say 5 or 10)... and start parsing the first file that is done downloading. So have some kind of priority queue and just grab the next fully downloaded file.
I realize the multithreaded processing could be tricky in R, but if you did so... I’d expect good performance gains if you are clever.
but you will have to download the files no matter what.
I also fear that @nathaneastwood wished that part away. fload()
does just that for you, but is being tricked by R leading to the issue. One way or another you have to schlepp all those soccer result bytes on your machine. No way around it.
(@lemire Just FYI, R, while single-threaded, ships with the parallel
package which has standard workhorses like mclapply
to apply a function in parallel. There are still issues with OSs that fork()
and those that don't but it is there...)
Fair enough, I guess today I learned something 👍
There is a Python library doing what I want to do and it uses asyncio
under the hood. The resulting json are downloaded virtually instantly. Whereas looping over them (lapply()
+ jsonlite::fromJSON()
) in R takes ~13 seconds on my machine so I will try and parallelise it and give that a go. I shall report back.
Not the most robust test but...
r$> system.time({
on.exit(if (exists("cl")) parallel::stopCluster(cl = cl), add = TRUE)
cl <- parallel::makeCluster(parallel::detectCores())
res <- parallel::parLapply(
cl = cl,
X = paste0("https://fantasy.premierleague.com/api/element-summary/", 1:599, "/"),
fun = fload_url_workaround
)
})
user system elapsed
0.095 0.100 16.436
r$> system.time({
on.exit(if (exists("cl")) parallel::stopCluster(cl = cl), add = TRUE)
cl <- parallel::makeCluster(parallel::detectCores())
res <- parallel::parLapply(
cl = cl,
X = paste0("https://fantasy.premierleague.com/api/element-summary/", 1:599, "/"),
fun = jsonlite::fromJSON
)
})
user system elapsed
0.075 0.062 9.276
@eddelbuettel A "good-enough" alternative would could just switch the method to "wget"
like the following (maybe with a different warning()
)...
interactive()
#> [1] FALSE
.fload <- function(URL) {
temp_file <- tempfile()
if (interactive()) {
.method <- getOption("download.file.method", default = "auto")
} else if (nchar(Sys.which("wget"))) {
.method <- "wget"
} else {
.method <- NULL
warning("wtf base R?")
}
download.file(
url = URL,
destfile = temp_file,
method = .method
)
RcppSimdJson::fload(temp_file)
}
target_url <- "http://fantasy.premierleague.com/api/element-summary/599/"
names(.fload(target_url))
#> [1] "fixtures" "history" "history_past"
@nathaneastwood libcurl has async capabilities under-the-hood, but it looks like download.file()
just isn't reliable if you use method="libcurl"
(seriously... wtf?). {RcppSimdJson}
leverages libcurl's async capabilities now, it's just news to me that it fails depending on the URL and whether your session is interactive()
.
For raw speed over many files in a non-interactive session, I'd probably leverage {curl}
and do something like the following.
interactive()
#> [1] FALSE
fload_em_all <- function(seq_json = 1:599) {
stopifnot(is.integer(seq_json) && length(seq_json) && !anyNA(seq_json))
target_urls <- sprintf("https://fantasy.premierleague.com/api/element-summary/%i/",
seq_json)
.envir <- new.env(size = length(target_urls))
pool <- curl::new_pool()
sapply(target_urls, function(.url) {
curl::curl_fetch_multi(
url = .url,
done = function(.done) assign("data", c(.envir$data, list(.done)), envir = .envir),
fail = function(.fail) warning("problem w/ ", .url),
pool = pool
)
})
curl_details <- curl::multi_run(pool = pool)
structure(
RcppSimdJson::fparse(lapply(.envir$data, `[[`, "content")),
curl_details = curl_details
)
}
system.time(
res <- fload_em_all()
)
#> user system elapsed
#> 1.475 0.079 5.160
attr(res, "curl_details")
#> $success
#> [1] 599
#>
#> $error
#> [1] 0
#>
#> $pending
#> [1] 0
if (requireNamespace("data.table", quietly = TRUE)) {
row_bind <- data.table::rbindlist
} else {
row_bind <- function(l) do.call(rbind.data.frame, l)
}
df_names <- c("fixtures", "history", "history_past")
names(df_names) <- df_names
cleaned <- lapply(df_names, function(.x) row_bind(lapply(res, `[[`, .x)))
lapply(cleaned, head)
#> $fixtures
#> id code team_h team_h_score team_a team_a_score event finished minutes provisional_start_time kickoff_time event_name is_home difficulty
#> 1: 82 2128369 10 NA 1 NA 9 FALSE 0 FALSE 2020-11-22T16:30:00Z Gameweek 9 FALSE 3
#> 2: 89 2128376 1 NA 20 NA 10 FALSE 0 FALSE 2020-11-28T15:00:00Z Gameweek 10 TRUE 3
#> 3: 106 2128393 17 NA 1 NA 11 FALSE 0 FALSE 2020-12-05T15:00:00Z Gameweek 11 FALSE 4
#> 4: 109 2128396 1 NA 4 NA 12 FALSE 0 FALSE 2020-12-13T15:00:00Z Gameweek 12 TRUE 2
#> 5: 119 2128406 1 NA 16 NA 13 FALSE 0 FALSE 2020-12-15T19:45:00Z Gameweek 13 TRUE 2
#> 6: 133 2128420 7 NA 1 NA 14 FALSE 0 FALSE 2020-12-19T15:00:00Z Gameweek 14 FALSE 3
#>
#> $history
#> element fixture opponent_team total_points was_home kickoff_time team_h_score team_a_score round minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index value transfers_balance selected transfers_in transfers_out
#> 1: 1 2 8 0 FALSE 2020-09-12T11:30:00Z 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 70 0 76656 0 0
#> 2: 1 9 19 0 TRUE 2020-09-19T19:00:00Z 2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 69 -16828 68335 995 17823
#> 3: 1 23 11 0 FALSE 2020-09-28T19:00:00Z 3 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 69 -11451 59793 675 12126
#> 4: 1 29 15 0 TRUE 2020-10-04T13:00:00Z 2 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 68 -5324 56403 647 5971
#> 5: 1 44 12 0 FALSE 2020-10-17T16:30:00Z 1 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 68 -4224 53689 616 4840
#> 6: 1 49 9 0 TRUE 2020-10-25T19:15:00Z 0 1 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 68 -3184 51023 276 3460
#>
#> $history_past
#> season_name element_code start_cost end_cost total_points minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index
#> 1: 2013/14 37605 100 96 137 2141 5 10 13 23 0 0 0 0 0 0 18 162 0.0 0.0 0.0 0.0
#> 2: 2014/15 37605 90 90 103 1857 4 6 9 16 0 0 0 0 0 0 13 511 0.0 0.0 0.0 0.0
#> 3: 2015/16 37605 85 92 200 3036 6 19 17 31 0 0 0 4 0 0 30 861 0.0 0.0 0.0 0.0
#> 4: 2016/17 37605 95 95 167 2841 8 11 12 30 0 0 0 2 0 0 19 729 834.4 1649.3 734.0 322.0
#> 5: 2017/18 37605 95 93 112 2161 4 9 8 36 0 0 0 4 0 0 11 506 640.6 1287.5 555.0 248.6
#> 6: 2018/19 37605 85 79 89 1732 5 3 5 24 0 0 0 2 0 0 9 363 429.6 714.2 254.0 140.0
FWIW, I was perusing #40 and came across the following which does work and is super fast!
multi_url <- paste0("https://fantasy.premierleague.com/api/element-summary/", 1:599, "/")
curl_async <- function(urls, .parse) { # rather than cram this inline...
out <- list()
for (i in seq_along(urls)) {
curl::curl_fetch_multi(
urls[[i]],
done = function(res) out <<- c(out, list(res$content)),
fail = function(msg) stop(msg)
)
}
curl::multi_run()
.parse(out)
}
system.time({
simd_curl_async <- curl_async(multi_url, .parse = RcppSimdJson::fparse)
})
user system elapsed
1.036 0.236 2.185
EDIT: Ha! Just seen your post above (posted seconds before I did) and it looks like I was on the right track! Thanks @knapply!
Nice work on both counts. Maybe we should start a 'gallery' or 'demos' vignette...
Yes, I think that would be helpful. Eventually, when I get my bum into gear, fpl
will be used in a Shiny app. So I need to learn more about this async stuff I feel. Might be a good topic for my blog as I learn more...
Oh @knapply FWIW re interactive session, it doesn't work for me:
interactive()
# TRUE
target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
RcppSimdJson::fload(target_url)
# Error in .load_json(json = input, query = query, empty_array = empty_array, :
# Empty: no JSON found
This is in VSCode. However, if I switch to RStudio, then magically it works...I have no idea why that would be...
Looking into this further, multiple urls do not work in RStudio either
> target_url <- sprintf("https://fantasy.premierleague.com/api/element-summary/%i/", 1:599)
> res <- RcppSimdJson::fload(target_url)
Error in .load_json(json = input, query = query, empty_array = empty_array, :
Error reading the file.
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Where the warnings are the same as above. Of course, this is not the approach I will be taking now but I thought it was interesting.
re interactive session, it doesn't work for me:
🤔 weird... I feel like this can't be the first time this download.file()
issue has come up so I'm poking around.
I feel like this can't be the first time this download.file() issue
It is not the first time we discuss curl, wget and all that. I think curl had the avantage of automatically negotiating a gzip version or something, and so it was faster?
That's, btw, a factor in all this, the downloads can be compressed or not. Compressed will be faster on a slow network.
For our sanity I would recommend we stay away from this. If and when we have curl
(which one can check e.g. via capabilities()[["libcurl"]]
) we can use it. If not, I think it might be easiest to let the user deal with that.
Just for one taste oh how much this changed or was mentioned:
edd@rob:~/deb/r-base/doc(master)$ grep -c download.file NEWS{,.?}
NEWS:1
NEWS.0:2
NEWS.1:11
NEWS.2:7
NEWS.3:32
edd@rob:~/deb/r-base/doc(master)$
We do provide an option to download compressed files:
Lines 31 to 36 in 8826fad
And we already check for "libcurl"
in order to exploit its asynchronous download capabilities, while still providing an escape hatch for users who set a "download.file.method"
option:
Lines 65 to 72 in 8826fad
From what I can tell, this is an R bug that must be related to .Internal(curlDownload(...))
or something in src/modules/internet/internet.c.
Yes indeed--good and helpful reminder. We in general want to rely as much on (lib)curl as we can as it is best of breed. That leaves us with maybe two items here:
- should we / could we just check the downloaded file and alert if size zero is seen?
- can we extract a bug report for R?
If you look at the download.file()
docs, you'll see the following for the headers
arg:
named character vector of HTTP headers to use in HTTP requests. It is ignored for non-HTTP URLs. The User-Agent header, coming from the
HTTPUserAgent
option (see options) is used as the first header, automatically.
What appears to be happening is that download.file()
doesn't actually cooperate with R's default options("HTTPUserAgent")
(unless method="wget"
?) for "non-standard" URLs.
interactive()
#> [1] FALSE
target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
temp_file <- tempfile(fileext = ".json")
download.file(method = "libcurl")
doesn't cooperate with R's default options("HTTPUserAgent")
:
options("HTTPUserAgent") # "default" user agent
#> $HTTPUserAgent
#> [1] "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
download.file(target_url, mode = "wb", method = "libcurl", destfile = temp_file)
file.size(temp_file) # nothing was downloaded
#> [1] 0
download.file(method = "libcurl")
does work if we customize options("HTTPUserAgent")
:
options(HTTPUserAgent = sprintf("RcppSimdJson; %s", options("HTTPUserAgent")))
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] "RcppSimdJson; R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
download.file(target_url, mode = "wb", destfile = temp_file)
file.size(temp_file) # actually downloads the data
#> [1] 8226
names(RcppSimdJson::fload(temp_file)) # good JSON
#> [1] "fixtures" "history" "history_past"
download.file(method = "libcurl")
does work if we remove options("HTTPUserAgent")
entirely (make it an empty string):
options(HTTPUserAgent = "")
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] ""
download.file(target_url, mode = "wb", method = "libcurl", destfile = temp_file)
file.size(temp_file) # actually downloads the data
#> [1] 8226
names(RcppSimdJson::fload(temp_file)) # good JSON
#> [1] "fixtures" "history" "history_past"
It works in RStudio because they set the "HTTPUserAgent"
option.
interactive() # in RStudio
#> [1] TRUE
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] "RStudio Desktop (1.4.869); R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
Why the default options("HTTPUserAgent")
isn't sufficient? No idea.
I edited my last comment significantly to (hopefully) clarify.
@dcooley , it looks like {jsonify}
is suffering from the same issue for non-standard URLs:
interactive()
#> [1] FALSE
options("HTTPUserAgent") # "default" user agent
#> $HTTPUserAgent
#> [1] "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
target_url <- "http://fantasy.premierleague.com/api/element-summary/599/"
names(jsonify::from_json(target_url))
#> Error in rcpp_from_json(json, simplify, fill_na): json parse error
options(HTTPUserAgent = "") # modify HTTPUserAgent
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] ""
names(jsonify::from_json(target_url))
#> [1] "fixtures" "history" "history_past"
Do you have enough crystalized here for a bug report at R's tracker? I guess a patch would be even sweeter but also like a deeper rabbit's hole...
Maybe, but I'm not even sure if this really an R bug at this point. Have packages been accidentally depending on RStudio setting the options("HTTPUserAgent")
this whole time?
I'm also not a "web" guy, so this may be really obvious to an R user who is...
@jeroen, if/when you can spare a moment, I'm guessing you have some insight into why download.file()
doesn't necessarily cooperate with the default options("HTTPUserAgent")
. Any insight you can share here would be really appreciated.
options("HTTPUserAgent") # "default" user agent
#> $HTTPUserAgent
#> [1] "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
temp_file <- tempfile()
# URL that doesn't point at a file, per se
irregular_url <- "http://fantasy.premierleague.com/api/element-summary/599/"
download.file(irregular_url, temp_file, method = "auto") # doesn't download anything
file.size(temp_file)
#> [1] 0
download.file(irregular_url, temp_file, method = "libcurl") # doesn't download anything
file.size(temp_file)
#> [1] 0
download.file(irregular_url, temp_file, method = "wget") # works as expected
file.size(temp_file)
#> [1] 8226
# URL that points directly at a file
regular_url <- "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/smalldemo.json"
download.file(regular_url, temp_file, method = "auto") # works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "libcurl") # works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "wget") # works as expected
file.size(temp_file)
#> [1] 204
# modify HTTPUserAgent
options(HTTPUserAgent = "")
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] ""
download.file(irregular_url, temp_file, method = "auto") # works as expected
file.size(temp_file)
#> [1] 8226
download.file(irregular_url, temp_file, method = "libcurl") # works as expected
file.size(temp_file)
#> [1] 8226
download.file(irregular_url, temp_file, method = "wget") # still works as expected
file.size(temp_file)
#> [1] 8226
download.file(regular_url, temp_file, method = "auto") # still works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "libcurl") # still works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "wget") # still works as expected
file.size(temp_file)
#> [1] 204
This is unrelated to R, the same happens with command line curl. My guess is that this server is trying to prevent scraping, and blocking any User-Agent
that starts curl curl/7
(the default):
curl -vL http://fantasy.premierleague.com/api/element-summary/599/
curl -vL http://fantasy.premierleague.com/api/element-summary/599/ -H "User-Agent: curl/7"
curl -vL http://fantasy.premierleague.com/api/element-summary/599/ -H "User-Agent: dirk"
In our case the failing default is "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)" though.
But the point is taken. It may just be this server having a funky policy.
I've been tested this out a little bit today. I am using the curl_async()
function here. What I am finding is that if I query all 600 pages, it fails more often than it succeeds. The offending function is curl::multi_run()
. I can easily put this in a while()
loop but I am wondering two things:
- why does it fail? (I get status 0)
- is there a way to add retries to this function without adding the
while()
loop?
Now this does feel more like a StackOverflow question so I'm happy to post there. I'm just looking to learn a bit more as I'm not a "web guy".
Sorry, this fell to the wayside. On balance, this appears to be (as @jeroen noted) a backend issue with that service. It was very helpful for us to look at this, but I do not think there is anything at fault in the package -- so I will close this now. Please feel free to reopen if disagree.