eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

no JSON found for valid URL

nathaneastwood opened this issue · comments

Attempting to download the Fantasy Premier League data and I get an error

Reproducible example:

RcppSimdJson::fload("https://fantasy.premierleague.com/api/element-summary/599/", verbose = TRUE)
# trying URL 'https://fantasy.premierleague.com/api/element-summary/599/'
# downloaded 0 bytes
# 
# Error in .load_json(json = input, query = query, empty_array = empty_array,  : 
#   Empty: no JSON found

(adjust the 599 as needed, though this is the last player in the game). You can visit the url and see that the data are in fact there.

FWIW, the data download with jsonlite (only name output included):

names(jsonlite::fromJSON("https://fantasy.premierleague.com/api/element-summary/599/"))                                                                      
# [1] "fixtures"     "history"      "history_past"

You can also wget the data with ease

wget https://fantasy.premierleague.com/api/element-summary/599/ -O football.json

Similar JSON files from the site such as this also do not work. Let me know if you need more.

It works in an interactive session, but I can reproduce by running in a non-interactive session (e.g. reprex::reprex()).

Interactive Session
interactive()
#> [1] TRUE
target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
RcppSimdJson::fload(target_url)
#> $fixtures
#>     id    code team_h team_h_score team_a team_a_score event finished minutes provisional_start_time         kickoff_time  event_name is_home difficulty
#> 1   82 2128369     10           NA      1           NA     9    FALSE       0                  FALSE 2020-11-22T16:30:00Z  Gameweek 9    TRUE          4
#> 2   93 2128380      7           NA     10           NA    10    FALSE       0                  FALSE 2020-11-28T17:30:00Z Gameweek 10   FALSE          3
#> 3  102 2128389      5           NA     10           NA    11    FALSE       0                  FALSE 2020-12-05T15:00:00Z Gameweek 11   FALSE          4
#> 4  113 2128400     10           NA     19           NA    12    FALSE       0                  FALSE 2020-12-12T15:00:00Z Gameweek 12    TRUE          3
#> 5  122 2128409     10           NA     14           NA    13    FALSE       0                  FALSE 2020-12-15T19:45:00Z Gameweek 13    TRUE          2
#> 6  134 2128421     13           NA     10           NA    14    FALSE       0                  FALSE 2020-12-19T15:00:00Z Gameweek 14   FALSE          4
#> 7  142 2128429     10           NA      4           NA    15    FALSE       0                  FALSE 2020-12-26T15:00:00Z Gameweek 15    TRUE          2
#> 8  158 2128445     18           NA     10           NA    16    FALSE       0                  FALSE 2020-12-28T15:00:00Z Gameweek 16   FALSE          2
#> 9  167 2128454     17           NA     10           NA    17    FALSE       0                  FALSE 2021-01-02T15:00:00Z Gameweek 17   FALSE          4
#> 10 172 2128459     10           NA     16           NA    18    FALSE       0                  FALSE 2021-01-12T19:45:00Z Gameweek 18    TRUE          2
#> 11 182 2128469     10           NA      3           NA    19    FALSE       0                  FALSE 2021-01-16T15:00:00Z Gameweek 19    TRUE          2
#> 12 196 2128483     14           NA     10           NA    20    FALSE       0                  FALSE 2021-01-27T19:45:00Z Gameweek 20   FALSE          3
#> 13 204 2128491      9           NA     10           NA    21    FALSE       0                  FALSE 2021-01-30T15:00:00Z Gameweek 21   FALSE          4
#> 14 212 2128499     10           NA      7           NA    22    FALSE       0                  FALSE 2021-02-02T19:45:00Z Gameweek 22    TRUE          3
#> 15 222 2128509     10           NA      6           NA    23    FALSE       0                  FALSE 2021-02-06T15:00:00Z Gameweek 23    TRUE          2
#> 16 229 2128516      1           NA     10           NA    24    FALSE       0                  FALSE 2021-02-13T15:00:00Z Gameweek 24   FALSE          4
#> 17 248 2128535     20           NA     10           NA    25    FALSE       0                  FALSE 2021-02-20T15:00:00Z Gameweek 25   FALSE          3
#> 18 252 2128539     10           NA      2           NA    26    FALSE       0                  FALSE 2021-02-27T15:00:00Z Gameweek 26    TRUE          3
#> 19 268 2128555     19           NA     10           NA    27    FALSE       0                  FALSE 2021-03-06T15:00:00Z Gameweek 27   FALSE          3
#> 20 273 2128560     10           NA      5           NA    28    FALSE       0                  FALSE 2021-03-13T15:00:00Z Gameweek 28    TRUE          4
#> 21 282 2128569      8           NA     10           NA    29    FALSE       0                  FALSE 2021-03-20T15:00:00Z Gameweek 29   FALSE          2
#> 22 293 2128580     10           NA     15           NA    30    FALSE       0                  FALSE 2021-04-03T14:00:00Z Gameweek 30    TRUE          3
#> 23 304 2128591     12           NA     10           NA    31    FALSE       0                  FALSE 2021-04-10T14:00:00Z Gameweek 31   FALSE          5
#> 24 313 2128600     10           NA     11           NA    32    FALSE       0                  FALSE 2021-04-17T14:00:00Z Gameweek 32    TRUE          4
#> 25 322 2128609     10           NA     13           NA    33    FALSE       0                  FALSE 2021-04-24T14:00:00Z Gameweek 33    TRUE          4
#> 26 329 2128616      3           NA     10           NA    34    FALSE       0                  FALSE 2021-05-01T14:00:00Z Gameweek 34   FALSE          3
#> 27 342 2128629     10           NA     17           NA    35    FALSE       0                  FALSE 2021-05-08T14:00:00Z Gameweek 35    TRUE          4
#> 28 350 2128637      4           NA     10           NA    36    FALSE       0                  FALSE 2021-05-11T18:45:00Z Gameweek 36   FALSE          2
#> 29 366 2128653     16           NA     10           NA    37    FALSE       0                  FALSE 2021-05-15T14:00:00Z Gameweek 37   FALSE          3
#> 30 372 2128659     10           NA     18           NA    38    FALSE       0                  FALSE 2021-05-23T15:00:00Z Gameweek 38    TRUE          2
#> 
#> $history
#>   element fixture opponent_team total_points was_home         kickoff_time team_h_score team_a_score round minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved
#> 1     599      72             6            0    FALSE 2020-11-07T15:00:00Z            4            1     8       0            0       0            0              0         0               0
#>   penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index value transfers_balance selected transfers_in transfers_out
#> 1                0            0         0     0     0   0       0.0        0.0    0.0       0.0    45                 0        0            0             0
#> 
#> $history_past
#> NULL

Long-story-short: it turns out there's an issue with utils::download.file() when run non-interactively:

interactive() 
#> [1] FALSE

target_url <- "http://fantasy.premierleague.com/api/element-summary/599/"
temp_file <- tempfile(fileext = ".json")

download.file(target_url, destfile = temp_file, method = "auto")
file.size(temp_file)
#> [1] 0

download.file(target_url, destfile = temp_file, method = "curl")
file.size(temp_file)
#> [1] 0

download.file(target_url, destfile = temp_file, method = "libcurl")
file.size(temp_file)
#> [1] 0

download.file(target_url, destfile = temp_file, method = "wget")
file.size(temp_file)
#> [1] 8226

@nathaneastwood, can you provide your sessionInfo() and verify this is happening in a non-interactive session on your end?

If that's the case, here's a temporary workaround until we find the fix (since you have wget available):

interactive()
#> [1] FALSE


fload_url_workaround <- function(URL) {
    temp_file <- tempfile()
    download.file(
        url = URL,
        destfile = temp_file, 
        method = "wget" # works
    )
    
    RcppSimdJson::fload(temp_file)
}

target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"

fload_url_workaround(target_url)
#> $fixtures
#>     id    code team_h team_h_score team_a team_a_score event finished minutes provisional_start_time         kickoff_time  event_name is_home difficulty
#> 1   82 2128369     10           NA      1           NA     9    FALSE       0                  FALSE 2020-11-22T16:30:00Z  Gameweek 9    TRUE          4
#> 2   93 2128380      7           NA     10           NA    10    FALSE       0                  FALSE 2020-11-28T17:30:00Z Gameweek 10   FALSE          3
#> 3  102 2128389      5           NA     10           NA    11    FALSE       0                  FALSE 2020-12-05T15:00:00Z Gameweek 11   FALSE          4
#> 4  113 2128400     10           NA     19           NA    12    FALSE       0                  FALSE 2020-12-12T15:00:00Z Gameweek 12    TRUE          3
#> 5  122 2128409     10           NA     14           NA    13    FALSE       0                  FALSE 2020-12-15T19:45:00Z Gameweek 13    TRUE          2
#> 6  134 2128421     13           NA     10           NA    14    FALSE       0                  FALSE 2020-12-19T15:00:00Z Gameweek 14   FALSE          4
#> 7  142 2128429     10           NA      4           NA    15    FALSE       0                  FALSE 2020-12-26T15:00:00Z Gameweek 15    TRUE          2
#> 8  158 2128445     18           NA     10           NA    16    FALSE       0                  FALSE 2020-12-28T15:00:00Z Gameweek 16   FALSE          2
#> 9  167 2128454     17           NA     10           NA    17    FALSE       0                  FALSE 2021-01-02T15:00:00Z Gameweek 17   FALSE          4
#> 10 172 2128459     10           NA     16           NA    18    FALSE       0                  FALSE 2021-01-12T19:45:00Z Gameweek 18    TRUE          2
#> 11 182 2128469     10           NA      3           NA    19    FALSE       0                  FALSE 2021-01-16T15:00:00Z Gameweek 19    TRUE          2
#> 12 196 2128483     14           NA     10           NA    20    FALSE       0                  FALSE 2021-01-27T19:45:00Z Gameweek 20   FALSE          3
#> 13 204 2128491      9           NA     10           NA    21    FALSE       0                  FALSE 2021-01-30T15:00:00Z Gameweek 21   FALSE          4
#> 14 212 2128499     10           NA      7           NA    22    FALSE       0                  FALSE 2021-02-02T19:45:00Z Gameweek 22    TRUE          3
#> 15 222 2128509     10           NA      6           NA    23    FALSE       0                  FALSE 2021-02-06T15:00:00Z Gameweek 23    TRUE          2
#> 16 229 2128516      1           NA     10           NA    24    FALSE       0                  FALSE 2021-02-13T15:00:00Z Gameweek 24   FALSE          4
#> 17 248 2128535     20           NA     10           NA    25    FALSE       0                  FALSE 2021-02-20T15:00:00Z Gameweek 25   FALSE          3
#> 18 252 2128539     10           NA      2           NA    26    FALSE       0                  FALSE 2021-02-27T15:00:00Z Gameweek 26    TRUE          3
#> 19 268 2128555     19           NA     10           NA    27    FALSE       0                  FALSE 2021-03-06T15:00:00Z Gameweek 27   FALSE          3
#> 20 273 2128560     10           NA      5           NA    28    FALSE       0                  FALSE 2021-03-13T15:00:00Z Gameweek 28    TRUE          4
#> 21 282 2128569      8           NA     10           NA    29    FALSE       0                  FALSE 2021-03-20T15:00:00Z Gameweek 29   FALSE          2
#> 22 293 2128580     10           NA     15           NA    30    FALSE       0                  FALSE 2021-04-03T14:00:00Z Gameweek 30    TRUE          3
#> 23 304 2128591     12           NA     10           NA    31    FALSE       0                  FALSE 2021-04-10T14:00:00Z Gameweek 31   FALSE          5
#> 24 313 2128600     10           NA     11           NA    32    FALSE       0                  FALSE 2021-04-17T14:00:00Z Gameweek 32    TRUE          4
#> 25 322 2128609     10           NA     13           NA    33    FALSE       0                  FALSE 2021-04-24T14:00:00Z Gameweek 33    TRUE          4
#> 26 329 2128616      3           NA     10           NA    34    FALSE       0                  FALSE 2021-05-01T14:00:00Z Gameweek 34   FALSE          3
#> 27 342 2128629     10           NA     17           NA    35    FALSE       0                  FALSE 2021-05-08T14:00:00Z Gameweek 35    TRUE          4
#> 28 350 2128637      4           NA     10           NA    36    FALSE       0                  FALSE 2021-05-11T18:45:00Z Gameweek 36   FALSE          2
#> 29 366 2128653     16           NA     10           NA    37    FALSE       0                  FALSE 2021-05-15T14:00:00Z Gameweek 37   FALSE          3
#> 30 372 2128659     10           NA     18           NA    38    FALSE       0                  FALSE 2021-05-23T15:00:00Z Gameweek 38    TRUE          2
#> 
#> $history
#>   element fixture opponent_team total_points was_home         kickoff_time team_h_score team_a_score round minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved
#> 1     599      72             6            0    FALSE 2020-11-07T15:00:00Z            4            1     8       0            0       0            0              0         0               0
#>   penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index value transfers_balance selected transfers_in transfers_out
#> 1                0            0         0     0     0   0       0.0        0.0    0.0       0.0    45                 0        0            0             0
#> 
#> $history_past
#> NULL

Wow. What the effing eff.

That's the type of host-system-dependent-hell ("do we have wget yes/no ?") I'd rather not go down. Ditto with all things proxies and what not. Part of me just wants to document it ("if your remote URL is non-standard expect surprises") and move on.

Do we understand the problem here? What is non-standard about this URI?

Standard URI: https://some.server.com/some/path/file.json
Non-standard: https://some.server.com/foo/bar

If you scroll up you see the URL that @nathaneastwood used. It is of the 2nd form. I had verified when he DMed me that it fails and suggested to file an issue. I had NOT grok'ed that interactive mode, and however R implements download.file() [ in short: many options from internal to external to building with libcurl to ... ] but @knapply was more rigorous here.

But in short, none of this has anything to with simdjson. All this is inside baseball about R tries (and fails) to be helpful as a shell and environment. Which is, after all, really hard. Across OS. And builds.

Thanks for the suggestion @knapply but I think I'd rather stick to using jsonlite as, whilst your solution might work for me, it wouldn't be great inside a distributed package.

You could write a two-line helper function that downloads to a local tempfile, possibly even giving the tempfile a json extension (yay my patch to R years ago to enable a tempfile extension) and then point RcppSimdJson at the now-local file.

That's not a bad suggestion however I was really hoping to take advantage of the speed element of RcppSimdJson though as I need to grab all 599 URLs in one go, i.e.

https://fantasy.premierleague.com/api/element-summary/1/
...
https://fantasy.premierleague.com/api/element-summary/599/

I don't want to "download" multiple files at once since I will be restricted by network speeds and downloading that many files isn't ideal either.

FWIW:

r$> sessionInfo()                                                                                                                                                
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RcppSimdJson_0.1.3

loaded via a namespace (and not attached):
[1] compiler_4.0.3 Rcpp_1.0.5     jsonlite_1.7.1

@eddelbuettel Now that you explain it, it does feel like documentation is the way to go... and possibly an informative error message when the download fails?

I don't want to "download" multiple files at once since I will be restricted by network speeds

I probably misunderstand but you will have to download the files no matter what.

If I had lots of files to parse, I’d be tempted to multithread the processing. Issue multiple file downloads at the same time (say 5 or 10)... and start parsing the first file that is done downloading. So have some kind of priority queue and just grab the next fully downloaded file.

I realize the multithreaded processing could be tricky in R, but if you did so... I’d expect good performance gains if you are clever.

but you will have to download the files no matter what.

I also fear that @nathaneastwood wished that part away. fload() does just that for you, but is being tricked by R leading to the issue. One way or another you have to schlepp all those soccer result bytes on your machine. No way around it.

(@lemire Just FYI, R, while single-threaded, ships with the parallel package which has standard workhorses like mclapply to apply a function in parallel. There are still issues with OSs that fork() and those that don't but it is there...)

Fair enough, I guess today I learned something 👍

There is a Python library doing what I want to do and it uses asyncio under the hood. The resulting json are downloaded virtually instantly. Whereas looping over them (lapply() + jsonlite::fromJSON()) in R takes ~13 seconds on my machine so I will try and parallelise it and give that a go. I shall report back.

Not the most robust test but...

r$> system.time({ 
    on.exit(if (exists("cl")) parallel::stopCluster(cl = cl), add = TRUE) 
    cl <- parallel::makeCluster(parallel::detectCores()) 
    res <- parallel::parLapply( 
      cl = cl, 
      X = paste0("https://fantasy.premierleague.com/api/element-summary/", 1:599, "/"), 
      fun = fload_url_workaround 
    ) 
    })                                                                                                                                                           
   user  system elapsed 
  0.095   0.100  16.436
r$> system.time({ 
      on.exit(if (exists("cl")) parallel::stopCluster(cl = cl), add = TRUE) 
      cl <- parallel::makeCluster(parallel::detectCores()) 
      res <- parallel::parLapply( 
        cl = cl, 
        X = paste0("https://fantasy.premierleague.com/api/element-summary/", 1:599, "/"), 
        fun = jsonlite::fromJSON 
      ) 
    })                                                                                                                                                           
   user  system elapsed 
  0.075   0.062   9.276

@eddelbuettel A "good-enough" alternative would could just switch the method to "wget" like the following (maybe with a different warning())...

interactive()
#> [1] FALSE

.fload <- function(URL) {
    temp_file <- tempfile()
    
    if (interactive()) {
        .method <- getOption("download.file.method", default = "auto")
    } else if (nchar(Sys.which("wget"))) {
        .method <- "wget"
    } else {
        .method <- NULL
        warning("wtf base R?")
    }

    download.file(
        url = URL,
        destfile = temp_file, 
        method = .method
    )
    
    RcppSimdJson::fload(temp_file)
}


target_url <- "http://fantasy.premierleague.com/api/element-summary/599/"

names(.fload(target_url))
#> [1] "fixtures"     "history"      "history_past"

@nathaneastwood libcurl has async capabilities under-the-hood, but it looks like download.file() just isn't reliable if you use method="libcurl" (seriously... wtf?). {RcppSimdJson} leverages libcurl's async capabilities now, it's just news to me that it fails depending on the URL and whether your session is interactive().

For raw speed over many files in a non-interactive session, I'd probably leverage {curl} and do something like the following.

interactive()
#> [1] FALSE

fload_em_all <- function(seq_json = 1:599) {
    stopifnot(is.integer(seq_json) && length(seq_json) && !anyNA(seq_json))
    
    target_urls <- sprintf("https://fantasy.premierleague.com/api/element-summary/%i/", 
                           seq_json)

    .envir <- new.env(size = length(target_urls))
    
    pool <- curl::new_pool()
    sapply(target_urls, function(.url) {
        curl::curl_fetch_multi(
            url = .url, 
            done = function(.done) assign("data", c(.envir$data, list(.done)), envir = .envir),
            fail = function(.fail) warning("problem w/ ", .url),
            pool = pool
        )
    })
    
    curl_details <- curl::multi_run(pool = pool)
    
    structure(
        RcppSimdJson::fparse(lapply(.envir$data, `[[`, "content")),
        curl_details = curl_details
    )
}

system.time(
    res <- fload_em_all()
)
#>    user  system elapsed 
#>   1.475   0.079   5.160

attr(res, "curl_details")
#> $success
#> [1] 599
#> 
#> $error
#> [1] 0
#> 
#> $pending
#> [1] 0


if (requireNamespace("data.table", quietly = TRUE)) {
    row_bind <- data.table::rbindlist
} else {
    row_bind <- function(l) do.call(rbind.data.frame, l)
}

df_names <- c("fixtures", "history", "history_past")
names(df_names) <- df_names

cleaned <- lapply(df_names, function(.x) row_bind(lapply(res, `[[`, .x)))
lapply(cleaned, head)
#> $fixtures
#>     id    code team_h team_h_score team_a team_a_score event finished minutes provisional_start_time         kickoff_time  event_name is_home difficulty
#> 1:  82 2128369     10           NA      1           NA     9    FALSE       0                  FALSE 2020-11-22T16:30:00Z  Gameweek 9   FALSE          3
#> 2:  89 2128376      1           NA     20           NA    10    FALSE       0                  FALSE 2020-11-28T15:00:00Z Gameweek 10    TRUE          3
#> 3: 106 2128393     17           NA      1           NA    11    FALSE       0                  FALSE 2020-12-05T15:00:00Z Gameweek 11   FALSE          4
#> 4: 109 2128396      1           NA      4           NA    12    FALSE       0                  FALSE 2020-12-13T15:00:00Z Gameweek 12    TRUE          2
#> 5: 119 2128406      1           NA     16           NA    13    FALSE       0                  FALSE 2020-12-15T19:45:00Z Gameweek 13    TRUE          2
#> 6: 133 2128420      7           NA      1           NA    14    FALSE       0                  FALSE 2020-12-19T15:00:00Z Gameweek 14   FALSE          3
#> 
#> $history
#>    element fixture opponent_team total_points was_home         kickoff_time team_h_score team_a_score round minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index value transfers_balance selected transfers_in transfers_out
#> 1:       1       2             8            0    FALSE 2020-09-12T11:30:00Z            0            3     1       0            0       0            0              0         0               0                0            0         0     0     0   0       0.0        0.0    0.0       0.0    70                 0    76656            0             0
#> 2:       1       9            19            0     TRUE 2020-09-19T19:00:00Z            2            1     2       0            0       0            0              0         0               0                0            0         0     0     0   0       0.0        0.0    0.0       0.0    69            -16828    68335          995         17823
#> 3:       1      23            11            0    FALSE 2020-09-28T19:00:00Z            3            1     3       0            0       0            0              0         0               0                0            0         0     0     0   0       0.0        0.0    0.0       0.0    69            -11451    59793          675         12126
#> 4:       1      29            15            0     TRUE 2020-10-04T13:00:00Z            2            1     4       0            0       0            0              0         0               0                0            0         0     0     0   0       0.0        0.0    0.0       0.0    68             -5324    56403          647          5971
#> 5:       1      44            12            0    FALSE 2020-10-17T16:30:00Z            1            0     5       0            0       0            0              0         0               0                0            0         0     0     0   0       0.0        0.0    0.0       0.0    68             -4224    53689          616          4840
#> 6:       1      49             9            0     TRUE 2020-10-25T19:15:00Z            0            1     6       0            0       0            0              0         0               0                0            0         0     0     0   0       0.0        0.0    0.0       0.0    68             -3184    51023          276          3460
#> 
#> $history_past
#>    season_name element_code start_cost end_cost total_points minutes goals_scored assists clean_sheets goals_conceded own_goals penalties_saved penalties_missed yellow_cards red_cards saves bonus bps influence creativity threat ict_index
#> 1:     2013/14        37605        100       96          137    2141            5      10           13             23         0               0                0            0         0     0    18 162       0.0        0.0    0.0       0.0
#> 2:     2014/15        37605         90       90          103    1857            4       6            9             16         0               0                0            0         0     0    13 511       0.0        0.0    0.0       0.0
#> 3:     2015/16        37605         85       92          200    3036            6      19           17             31         0               0                0            4         0     0    30 861       0.0        0.0    0.0       0.0
#> 4:     2016/17        37605         95       95          167    2841            8      11           12             30         0               0                0            2         0     0    19 729     834.4     1649.3  734.0     322.0
#> 5:     2017/18        37605         95       93          112    2161            4       9            8             36         0               0                0            4         0     0    11 506     640.6     1287.5  555.0     248.6
#> 6:     2018/19        37605         85       79           89    1732            5       3            5             24         0               0                0            2         0     0     9 363     429.6      714.2  254.0     140.0

FWIW, I was perusing #40 and came across the following which does work and is super fast!

multi_url <- paste0("https://fantasy.premierleague.com/api/element-summary/", 1:599, "/")

curl_async <- function(urls, .parse) { # rather than cram this inline...
  out <- list()
  for (i in seq_along(urls)) {
    curl::curl_fetch_multi(
      urls[[i]],
      done = function(res) out <<- c(out, list(res$content)),
      fail = function(msg) stop(msg)
    )
  }
  curl::multi_run()
  .parse(out)
}

system.time({
  simd_curl_async <- curl_async(multi_url, .parse = RcppSimdJson::fparse)
})
   user  system elapsed 
  1.036   0.236   2.185

EDIT: Ha! Just seen your post above (posted seconds before I did) and it looks like I was on the right track! Thanks @knapply!

Nice work on both counts. Maybe we should start a 'gallery' or 'demos' vignette...

Yes, I think that would be helpful. Eventually, when I get my bum into gear, fpl will be used in a Shiny app. So I need to learn more about this async stuff I feel. Might be a good topic for my blog as I learn more...

Oh @knapply FWIW re interactive session, it doesn't work for me:

interactive()
# TRUE
target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
RcppSimdJson::fload(target_url)
# Error in .load_json(json = input, query = query, empty_array = empty_array,  : 
#   Empty: no JSON found

This is in VSCode. However, if I switch to RStudio, then magically it works...I have no idea why that would be...

Looking into this further, multiple urls do not work in RStudio either

> target_url <- sprintf("https://fantasy.premierleague.com/api/element-summary/%i/", 1:599)
> res <- RcppSimdJson::fload(target_url)
Error in .load_json(json = input, query = query, empty_array = empty_array,  : 
  Error reading the file.
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Where the warnings are the same as above. Of course, this is not the approach I will be taking now but I thought it was interesting.

re interactive session, it doesn't work for me:

🤔 weird... I feel like this can't be the first time this download.file() issue has come up so I'm poking around.

I feel like this can't be the first time this download.file() issue

It is not the first time we discuss curl, wget and all that. I think curl had the avantage of automatically negotiating a gzip version or something, and so it was faster?

That's, btw, a factor in all this, the downloads can be compressed or not. Compressed will be faster on a slow network.

For our sanity I would recommend we stay away from this. If and when we have curl (which one can check e.g. via capabilities()[["libcurl"]]) we can use it. If not, I think it might be easiest to let the user deal with that.

Just for one taste oh how much this changed or was mentioned:

edd@rob:~/deb/r-base/doc(master)$ grep -c download.file NEWS{,.?}
NEWS:1
NEWS.0:2
NEWS.1:11
NEWS.2:7
NEWS.3:32
edd@rob:~/deb/r-base/doc(master)$ 

We do provide an option to download compressed files:

rcppsimdjson/R/utils.R

Lines 31 to 36 in 8826fad

if (compressed_download) {
.headers <- c(`Accept-Encoding` = "gzip")
# for local files, don't attach .gz
.fileext <- rep(NA_character_, nrow(diagnosis))
.fileext[diagnosis$is_local_file_url] <- diagnosis$file_ext[diagnosis$is_local_file_url]
.fileext[diagnosis$is_remote_file_url] <- sprintf("%s.gz", diagnosis$file_ext[diagnosis$is_remote_file_url])

And we already check for "libcurl" in order to exploit its asynchronous download capabilities, while still providing an escape hatch for users who set a "download.file.method" option:

rcppsimdjson/R/utils.R

Lines 65 to 72 in 8826fad

if ((.method <- getOption("download.file.method", default = "auto")) == "libcurl") {
download.file(
url = .url,
method = .method,
destfile = .destfile,
quiet = !verbose,
headers = .headers
)

From what I can tell, this is an R bug that must be related to .Internal(curlDownload(...)) or something in src/modules/internet/internet.c.

Yes indeed--good and helpful reminder. We in general want to rely as much on (lib)curl as we can as it is best of breed. That leaves us with maybe two items here:

  • should we / could we just check the downloaded file and alert if size zero is seen?
  • can we extract a bug report for R?

If you look at the download.file() docs, you'll see the following for the headers arg:

named character vector of HTTP headers to use in HTTP requests. It is ignored for non-HTTP URLs. The User-Agent header, coming from the HTTPUserAgent option (see options) is used as the first header, automatically.

What appears to be happening is that download.file() doesn't actually cooperate with R's default options("HTTPUserAgent") (unless method="wget"?) for "non-standard" URLs.

interactive() 
#> [1] FALSE

target_url <- "https://fantasy.premierleague.com/api/element-summary/599/"
temp_file <- tempfile(fileext = ".json")

download.file(method = "libcurl") doesn't cooperate with R's default options("HTTPUserAgent"):

options("HTTPUserAgent") # "default" user agent
#> $HTTPUserAgent
#> [1] "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
download.file(target_url, mode = "wb", method = "libcurl", destfile = temp_file)
file.size(temp_file) # nothing was downloaded
#> [1] 0

download.file(method = "libcurl") does work if we customize options("HTTPUserAgent"):

options(HTTPUserAgent = sprintf("RcppSimdJson; %s", options("HTTPUserAgent"))) 
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] "RcppSimdJson; R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
download.file(target_url, mode = "wb", destfile = temp_file)
file.size(temp_file) # actually downloads the data
#> [1] 8226
names(RcppSimdJson::fload(temp_file)) # good JSON
#> [1] "fixtures"     "history"      "history_past"

download.file(method = "libcurl") does work if we remove options("HTTPUserAgent") entirely (make it an empty string):

options(HTTPUserAgent = "")
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] ""
download.file(target_url, mode = "wb", method = "libcurl", destfile = temp_file)
file.size(temp_file) # actually downloads the data
#> [1] 8226
names(RcppSimdJson::fload(temp_file)) # good JSON
#> [1] "fixtures"     "history"      "history_past"

It works in RStudio because they set the "HTTPUserAgent" option.

interactive()  # in RStudio
#> [1] TRUE
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] "RStudio Desktop (1.4.869); R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"

Why the default options("HTTPUserAgent") isn't sufficient? No idea.

I edited my last comment significantly to (hopefully) clarify.

@dcooley , it looks like {jsonify} is suffering from the same issue for non-standard URLs:

interactive() 
#> [1] FALSE

options("HTTPUserAgent") # "default" user agent
#> $HTTPUserAgent
#> [1] "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"

target_url <- "http://fantasy.premierleague.com/api/element-summary/599/"

names(jsonify::from_json(target_url))
#> Error in rcpp_from_json(json, simplify, fill_na): json parse error


options(HTTPUserAgent = "") # modify HTTPUserAgent
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] ""

names(jsonify::from_json(target_url))
#> [1] "fixtures"     "history"      "history_past"

Do you have enough crystalized here for a bug report at R's tracker? I guess a patch would be even sweeter but also like a deeper rabbit's hole...

Maybe, but I'm not even sure if this really an R bug at this point. Have packages been accidentally depending on RStudio setting the options("HTTPUserAgent") this whole time?

I'm also not a "web" guy, so this may be really obvious to an R user who is...

@jeroen, if/when you can spare a moment, I'm guessing you have some insight into why download.file() doesn't necessarily cooperate with the default options("HTTPUserAgent"). Any insight you can share here would be really appreciated.

options("HTTPUserAgent") # "default" user agent
#> $HTTPUserAgent
#> [1] "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"

temp_file <- tempfile()


# URL that doesn't point at a file, per se
irregular_url <- "http://fantasy.premierleague.com/api/element-summary/599/"

download.file(irregular_url, temp_file, method = "auto")    # doesn't download anything
file.size(temp_file)
#> [1] 0
download.file(irregular_url, temp_file, method = "libcurl") # doesn't download anything
file.size(temp_file)
#> [1] 0
download.file(irregular_url, temp_file, method = "wget")    # works as expected
file.size(temp_file)
#> [1] 8226


# URL that points directly at a file
regular_url <- "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/smalldemo.json"

download.file(regular_url, temp_file, method = "auto")    # works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "libcurl") # works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "wget")    # works as expected
file.size(temp_file)
#> [1] 204


# modify HTTPUserAgent
options(HTTPUserAgent = "")
options("HTTPUserAgent")
#> $HTTPUserAgent
#> [1] ""

download.file(irregular_url, temp_file, method = "auto")    # works as expected
file.size(temp_file)
#> [1] 8226
download.file(irregular_url, temp_file, method = "libcurl") # works as expected
file.size(temp_file)
#> [1] 8226
download.file(irregular_url, temp_file, method = "wget")    # still works as expected
file.size(temp_file)
#> [1] 8226



download.file(regular_url, temp_file, method = "auto")    # still works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "libcurl") # still works as expected
file.size(temp_file)
#> [1] 204
download.file(regular_url, temp_file, method = "wget")    # still works as expected
file.size(temp_file)
#> [1] 204

This is unrelated to R, the same happens with command line curl. My guess is that this server is trying to prevent scraping, and blocking any User-Agent that starts curl curl/7 (the default):

curl -vL http://fantasy.premierleague.com/api/element-summary/599/
curl -vL http://fantasy.premierleague.com/api/element-summary/599/ -H "User-Agent: curl/7"
curl -vL http://fantasy.premierleague.com/api/element-summary/599/ -H "User-Agent: dirk"

In our case the failing default is "R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)" though.

But the point is taken. It may just be this server having a funky policy.

I've been tested this out a little bit today. I am using the curl_async() function here. What I am finding is that if I query all 600 pages, it fails more often than it succeeds. The offending function is curl::multi_run(). I can easily put this in a while() loop but I am wondering two things:

  1. why does it fail? (I get status 0)
  2. is there a way to add retries to this function without adding the while() loop?

Now this does feel more like a StackOverflow question so I'm happy to post there. I'm just looking to learn a bit more as I'm not a "web guy".

Sorry, this fell to the wayside. On balance, this appears to be (as @jeroen noted) a backend issue with that service. It was very helpful for us to look at this, but I do not think there is anything at fault in the package -- so I will close this now. Please feel free to reopen if disagree.