IQSS / dataverse-client-r

R Client for Dataverse Repositories

Home Page:https://iqss.github.io/dataverse-client-r

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GET by file DOI errors out on demo dataset

kuriwaki opened this issue · comments

In our demo datafile which we have been testing against from the beginning, calling by DOI errors out (even though calling by name works fine). Another dataset on the Harvard succeeds regardless.

library(dataverse)
library(tibble)
packageVersion("dataverse")
#> [1] '0.3.7'


# Try to Read demo dataset by DOI
get_dataframe_by_doi(
  filedoi  = "doi:10.70122/FK2/HXJVJU/SA3Z2V", 
  original = TRUE,
  .f       = readr::read_csv,
  server = "demo.dataverse.org"
)
#> Error in get_file_by_id(fileid = fileid[i], dataset = dataset, format = format, : Not Found (HTTP 404). Failed to API endpoint does not exist on this server. Please check your code for typos, or consult our API guide at http://guides.dataverse.org..

# read the same file by name
get_dataframe_by_name(
  filename  = "roster-bulls-1996.tab", 
  dataset  = "doi:10.70122/FK2/HXJVJU",
  original = TRUE,
  .f       = readr::read_csv,
  server = "demo.dataverse.org"
) %>% dim()
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   number = col_double(),
#>   player = col_character(),
#>   position = col_character(),
#>   height = col_character(),
#>   weight = col_double(),
#>   dob = col_character(),
#>   country_birth = col_character(),
#>   experience_years = col_double(),
#>   college = col_character()
#> )
#> [1] 15  9


# Different datafile with file-level DOI
get_dataframe_by_doi(
  filedoi  = "doi:10.7910/DVN/ARKOTI/QJTSWV", 
  original = TRUE,
  .f       = haven::read_dta,
  server = "dataverse.harvard.edu"
) %>% dim()
#> [1] 60 10

# same but by name
get_dataframe_by_name(
  filename  = "comprehensiveJapanEnergy.tab", 
  dataset  = "doi:10.7910/DVN/ARKOTI",
  original = TRUE,
  .f       = haven::read_dta,
  server = "dataverse.harvard.edu"
) %>% dim()
#> [1] 60 10

Created on 2021-04-11 by the reprex package (v0.3.0)

The error occurs in the core httr GET function. Here is the request URL for the one that doesn't work:

https://demo.dataverse.org/api/access/datafile/:persistentId/?persistentId=doi%3A10.70122%2FFK2%2FHXJVJU%2FSA3Z2V&format=original

and here is the request URL for the other dataset that does work:

https://dataverse.harvard.edu/api/access/datafile/:persistentId/?persistentId=doi%3A10.7910%2FDVN%2FARKOTI%2FQJTSWV&format=original

@djbrooke do you notice anything different about the URLs (or something obviously wrong with the former?)

It looks like removing the final / before the ? resolves the issue on demo.dataverse.

Does this have to do with a doi in itself not being resolvable on demo dataverses? Publication on demo dataverses don't get resolvable doi's, so if it doesn't resolve, the get_dataframe_by_doi won't know where to look? I'm assuming this won't work on other datasets in demo dataverses (I tried it with my own test dataverses in demo.dataverse.nl?

The Dataverse APIs that allow :persistentId as an option don't try to resolve the DOI at DataCite, so they should work fine with fake/unresolvable DOIs.

So yes, we broke the notation with the trailing slash (access/datafile/:persistentId/?persistentId=...) in v5.4. Which of course is a problem, since this is how it's documented in the guide, unfortunately.
I have made a fix (IQSS/dataverse#7789), and we are planning to release it as v5.4.1. (So the plan is for Harvard prod. Dataverse to go from v5.3 to v5.4.1 directly).

@qqmyers @landreev. Thank you, it all makes sense now.

Can v5.3 (the old one) accept URLs without the backslash? That seems to be the case in the Harvard dataverse datafile above. If so, we can just hard-code the removal of the backslash in all our calls.

If not, we'll need to check the version number of the Dataverse for a given server value, and then remove based on version >= 5.4.1.

I removed the backslash from all calls in PR #95, and so far I don't see a problem in any of our unit tests.

@qqmyers @landreev. Thank you, it all makes sense now.

Can v5.3 (the old one) accept URLs without the backslash?

Yes, the notation without the backslash is supported in all versions, past and future.
We consider it to be the recommended form (in fact, all of our automated tests call the API without this trailing backslash - so that's why we hadn't noticed it, until you reported the issue).
But, as I mentioned earlier, we realized that the API guide has it documented with the backslash... So, even if you make this change in your code, we definitely want to have it fixed in future releases - because other people are likely using that notation too.

@landreev I definitely followed that section of the API guide when I wrote this part of the package. So, that's good to know.

@wibeasley the PR should fix most if not all the CRAN errors, although there may be places other than get_file that may fail due to the same issue.

Thank you for reporting it. I'm glad you noticed before we had a chance to deploy it in production.