IQSS / dataverse-client-r

R Client for Dataverse Repositories

Home Page:https://iqss.github.io/dataverse-client-r

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to retrieve an unpublished data file.

famuvie opened this issue · comments

Please specify whether your issue is about:

  • a suggested code or documentation change, improvement to the code, or feature request

I'm having an issue while trying to access an unpublished data file, which requires an API token. Unfortunately, this makes the following code not reproducible. Let me know if there is a way of making a reproducible example in this case.

The problem is that I cannot use any of the get_dataframe_by_* functions, due to an issue with is_ingested() which seems unable to find the target file.
However, if I work around is_ingested() I can retrieve the data as is shown in the example below.

library(dataverse)
packageVersion("dataverse")
#> [1] '0.3.10'

get_dataframe_by_id(
  fileid = 12930,
  dataset = "https://doi.org/10.18167/DVN1/8Z1ZI9"
)
#> Error in is_ingested(fileid, ...): File information not found on Dataverse API

# A successful read
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
fileid = 12930
query <- list(format = "original")
u_part <- "access/datafile/"
u <- paste0(dataverse:::api_url(server), u_part, fileid)
r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query)
httr::content(r)
#> Rows: 1347 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): ;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,347 × 1
#>    `;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus;H_scupe…
#>    <chr>                                                                        
#>  1 1;200532 2248;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  2 2;200531 7320;26/02/2020;CASTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  3 3;200530 9555;26/02/2020;ZALANA;;;1;0;0;0;0;0;0;0;1                          
#>  4 4;200532 8365;26/02/2020;MOLTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  5 5;200533 1185;26/02/2020;PIEDIGRIGGIO;;;0;0;0;2;0;0;0;0;2                    
#>  6 6;200532 3312;26/02/2020;CORTE;;;0;0;0;0;0;0;0;0;0                           
#>  7 7;200531 0907;26/02/2020;BORGO;;;0;0;0;0;0;0;0;0;0                           
#>  8 8;200532 2246;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  9 9;200530 8506;26/02/2020;CORTE;;;1;0;0;0;0;0;0;0;1                           
#> 10 10;200532 2245;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                      
#> # … with 1,337 more rows
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Linux Mint 20.1
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13 knitr_1.37      magrittr_2.0.2  rlang_1.0.0    
#>  [5] fastmap_1.1.0   fansi_0.5.0     stringr_1.4.0   styler_1.4.1   
#>  [9] highr_0.9       tools_4.1.2     xfun_0.29       utf8_1.2.2     
#> [13] cli_3.1.0       withr_2.4.2     htmltools_0.5.2 ellipsis_0.3.2 
#> [17] yaml_2.2.2      digest_0.6.29   tibble_3.1.6    lifecycle_1.0.1
#> [21] crayon_1.4.2    purrr_0.3.4     vctrs_0.3.8     fs_1.5.2       
#> [25] glue_1.6.1      evaluate_0.14   rmarkdown_2.11  reprex_2.0.1   
#> [29] stringi_1.7.6   compiler_4.1.2  pillar_1.6.4    backports_1.4.1
#> [33] pkgconfig_2.0.3

Ultimately, the problem in is_ingested() boils down to dataverse_search() not finding the file:

library(dataverse)
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930", type = "file", server = server, key = key)
#> 0 of 0 results retrieved
#> list()

Created on 2022-02-04 by the reprex package (v2.0.1)

It is worth noting that I can find the file using some keywords on the web interface.

Whereas dataverse_search() correctly finds a published file.

Hmm, because the file is in draft, I bet _draft would need to be appended like this:

id = "datafile_12930_draft"

@famuvie do you want to see if you can find your draft file that way with curl? You'll have to pass your API token. Docs on this at https://guides.dataverse.org/en/5.9/api/search.html

@kuriwaki this might also work:

entityId:12930

An example: https://dataverse.harvard.edu/api/search?q=entityId:3371438

(I'm not sure why I suggested id instead of entityId at #113 (comment) . The id changes (_draft is dropped on publish) but entityId stays the same.)

Not sure how to pass the API token with curl, but it works with dataverse_search():

library(dataverse)
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930_draft", type = "file", server = server, key = key)
#> 1 of 1 result retrieved
#>                   name type
#> 1 Bovine_2020_2021.tab file
#>                                                    url file_id
#> 1 https://dataverse.cirad.fr/api/access/datafile/12930   12930
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              description
#> 1 On this document, there is only the tick data collection between 2020 and 2021.\n\nSome information about variable :\n\n- Vache : Identifier of cow (10 digits)\n- Date : date of slaughterhouse visit\n- Commune : origin cow municipality\n- Eleveur : origin cow breeder\n- Troupeau : municipality and breeder\n- H_marginatus : number of *H_marginatus* collected\n- R_bursa : number of *R_bursa* collected\n- I_ricinus : number of *I_ricinus* collected\n- H_scupense : number of *H_scupense* collected\n- B_annulatus : number of *B_annulatus* collected\n- R_sanguineus : number of *R_sanguineus* collected\n- H_punctata : number of *H_punctata* collected.\n- D_marginatus : number of *D_marginatus* collected\n- Tiques ? : sum of ticks collected
#>       file_type         file_content_type size_in_bytes
#> 1 Tab-Delimited text/tab-separated-values         69648
#>                                md5 checksum.type
#> 1 688c6fc5f92e6526a3cd158854027e8b           MD5
#>                     checksum.value                            unf dataset_name
#> 1 688c6fc5f92e6526a3cd158854027e8b UNF:6:lt7ZJ1diuShhMCd8UWq5zQ==       Bovine
#>   dataset_id    dataset_persistent_id
#> 1      12928 doi:10.18167/DVN1/8Z1ZI9
#>                                                                                                                                         dataset_citation
#> 1 Bartholomee, Colombine, 2022, "Bovine", https://doi.org/10.18167/DVN1/8Z1ZI9, CIRAD Dataverse, DRAFT VERSION, UNF:6:ov2odYXNktsIbiuwc2MDJQ== [fileUNF]

Created on 2022-02-04 by the reprex package (v2.0.1)

I'm not sure how to pass the API token with curl. I'll check.

You can pass it as a header or a query parameter. Please see https://guides.dataverse.org/en/5.9/api/auth.html

Sorry, I made a mistake in the previous example and have just corrected it. It actually works!

Still, I can't find a hacky way for adding "_draft" to the file id. I guess that needs to be fixed in the package :)

Thanks @famuvie for creating an issue. A partial fix is now on dev.
@pdurbin, thanks for pointing out entityId. I implemented it on dev as there seems to be no downside.

I created a test dataset on demo dataverse that is intentionally unpublished. The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API. Have you seen this before?

Proper UNF detection becomes necessary since that's how it currently determines if a file is ingested or not.

> str(dataset_files(dataset = "10.70122/FK2/4XHVAP", server = "demo.dataverse.org")[[1]]$dataFile)
List of 16
 $ id                 : int 1951382
 $ persistentId       : chr ""
 $ pidURL             : chr ""
 $ filename           : chr "mtcars.tab"
 $ contentType        : chr "text/tab-separated-values"
 $ filesize           : int 1713
 $ storageIdentifier  : chr "s3://demo-dataverse-org:17f75571af3-60325bcbb1f1"
 $ originalFileFormat : chr "text/csv"
 $ originalFormatLabel: chr "Comma Separated Values"
 $ originalFileSize   : int 1700
 $ originalFileName   : chr "mtcars.csv"
 $ UNF                : chr "UNF:6:KRE/AItWGJWd5tJ+bboN7A=="
 $ rootDataFileId     : int -1
 $ md5                : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum           :List of 2
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ creationDate       : chr "2022-03-10"
> str(dataverse_search(entityId = 1951382, server = "demo.dataverse.org", key = Sys.getenv("DATAVERSE_KEY")))
1 of 1 result retrieved
'data.frame':	1 obs. of  13 variables:
 $ name                 : chr "mtcars.csv"
 $ type                 : chr "file"
 $ url                  : chr "https://demo.dataverse.org/api/access/datafile/1951382"
 $ file_id              : chr "1951382"
 $ file_type            : chr "Comma Separated Values"
 $ file_content_type    : chr "text/csv"
 $ size_in_bytes        : int 1700
 $ md5                  : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum             :'data.frame':	1 obs. of  2 variables:
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ dataset_name         : chr "Permanent draft dataset for testing"
 $ dataset_id           : chr "1951381"
 $ dataset_persistent_id: chr "doi:10.70122/FK2/4XHVAP"
 $ dataset_citation     : chr "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\", https://doi.org/10.70122/FK2/4XHVAP, Demo Datav"| __truncated__

The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API.

Huh. This is news to me but I see what you mean.

No UNF from the Search API when I look at your unpublished file...

curl -H X-Dataverse-key:$API_TOKEN https://demo.dataverse.org/api/search?q=id:datafile_1951382_draft

{
  "status": "OK",
  "data": {
    "q": "id:datafile_1951382_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "mtcars.csv",
        "type": "file",
        "url": "https://demo.dataverse.org/api/access/datafile/1951382",
        "file_id": "1951382",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 1700,
        "md5": "c502359c26a0931eef53b2207b2344f9",
        "checksum": {
          "type": "MD5",
          "value": "c502359c26a0931eef53b2207b2344f9"
        },
        "dataset_name": "Permanent draft dataset for testing",
        "dataset_id": "1951381",
        "dataset_persistent_id": "doi:10.70122/FK2/4XHVAP",
        "dataset_citation": "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\", https://doi.org/10.70122/FK2/4XHVAP, Demo Dataverse, DRAFT VERSION"
      }
    ],
    "count_in_response": 1
  }
}

... but when I look at a published file (different server but shouldn't matter), I do see a UNF:

curl https://dataverse.harvard.edu/api/search?q=id:datafile_3371438
{
  "status": "OK",
  "data": {
    "q": "id:datafile_3371438",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2019-02-25.tab",
        "type": "file",
        "url": "https://dataverse.harvard.edu/api/access/datafile/3371438",
        "file_id": "3371438",
        "description": "",
        "published_at": "2019-02-26T03:03:13Z",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 17232,
        "md5": "9bd94d028049c9a53bca9bb19d4fb57e",
        "checksum": {
          "type": "MD5",
          "value": "9bd94d028049c9a53bca9bb19d4fb57e"
        },
        "unf": "UNF:6:2MMoV8KKO8R7sb27Q5GXtA==",
        "file_persistent_id": "doi:10.7910/DVN/TJCLKP/3VSTKY",
        "dataset_name": "Open Source at Harvard",
        "dataset_id": "3035124",
        "dataset_persistent_id": "doi:10.7910/DVN/TJCLKP",
        "dataset_citation": "Durbin, Philip, 2017, \"Open Source at Harvard\", https://doi.org/10.7910/DVN/TJCLKP, Harvard Dataverse, DRAFT VERSION, UNF:6:2MMoV8KKO8R7sb27Q5GXtA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}

Perhaps we don't reindex the file after ingest is complete? I'm not sure. You could test this by making a change to your draft dataset metadata (add a keyword or something). This will reindex the dataaset and its files.

Yes! It was sufficient to add a data description to the draft dataset, and it somehow updated. Thank you.

@kuriwaki hmm, I can replicate this on "develop" on my laptop (around 0d853b74e9). When I first upload a file to a draft, the UNF does not appear in search results...

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
{
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2016-06-29.csv",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 58690,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        },
        "dataset_name": "zzz",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzz\", https://doi.org/10.5072/FK2/JJK8WY, Root, DRAFT VERSION"
      }
    ],
    "count_in_response": 1
  }
}

... but if I edit the metadata of the draft dataset (forcing the file to be reindexed, the UNF appears):

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
{
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2016-06-29.tab",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 59208,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        },
        "unf": "UNF:6:6YVg+pUWsYD52stDkZuzUA==",
        "dataset_name": "zzzyyy",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzzyyy\", https://doi.org/10.5072/FK2/JJK8WY, Root, DRAFT VERSION, UNF:6:6YVg+pUWsYD52stDkZuzUA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}

Please feel free to open an issue about this at https://github.com/IQSS/dataverse/issues if you'd like.

I will put a tip about this in the dataverse download vignette. I think it is a limitation that might be common to people who try to download draft datasets, but the current method to edit something seems not too onerous.

Addressed by 0.3.11.