IQSS / dataverse-client-r

R Client for Dataverse Repositories

Home Page:https://iqss.github.io/dataverse-client-r

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in prepend_doi(filedoi) : argument "filedoi" is missing, with no default

sjkiss opened this issue · comments

I would like to download the stata tab file here but am getting this error. Running dataverse 0.3.7

Note that I'm pretty sure I've set the server and the key properly.

Error in prepend_doi(filedoi) : 
  argument "filedoi" is missing, with no default
## code goes here

library(dataverse)
get_dataframe_by_doi(filename="2019 Canadian Election Study - Online Survey v1.0.tab", 
                     original=T, .f=haven::read_dta,dataset="doi:10.7910/DVN/DUS88V", 
                     server="dataverse.harvard.edu")



## session info for your system
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dataverse_0.3.7 labelled_2.8.0  cesdata_0.1.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        knitr_1.31        magrittr_2.0.1   
 [4] hms_1.0.0         tidyselect_1.1.0  R6_2.5.0         
 [7] rlang_0.4.10      fansi_0.4.2       stringr_1.4.0    
[10] dplyr_1.0.5       tools_4.0.4       xfun_0.22        
[13] utf8_1.1.4        DBI_1.1.1         htmltools_0.5.1.1
[16] ellipsis_0.3.1    assertthat_0.2.1  yaml_2.2.1       
[19] digest_0.6.27     tibble_3.1.0      lifecycle_1.0.0  
[22] crayon_1.4.1      tidyr_1.1.3       purrr_0.3.4      
[25] vctrs_0.3.6       glue_1.4.2        evaluate_0.14    
[28] haven_2.3.1       rmarkdown_2.7     stringi_1.5.3    
[31] compiler_4.0.4    pillar_1.5.1      forcats_0.5.1    
[34] generics_0.1.0    pkgconfig_2.0.3 

@sjkiss, thanks for filing such a thorough issue. I think you can get your desired output with two changes.

  1. change dataverse::get_dataframe_by_doi() to dataverse::get_dataframe_by_name()
  2. change file extension from "tab" (the file created by Dataverse during the ingestion) to "dta" (the original Stata file extension), because you specified original = TRUE

@pdurbin, this dataset doesn't have a *.tab file, unlike a test dataset with a Stata file. I see this dataset was uploaded May 2020 --did the server software start converting Stata files between then and Dec 2020?

dataverse::get_dataframe_by_name(
  filename   = "2019 Canadian Election Study - Online Survey v1.0.dta", 
  original   = TRUE, 
  .f         = haven::read_dta,
  dataset    = "doi:10.7910/DVN/DUS88V", 
  server     = "dataverse.harvard.edu"
)

output:

# A tibble: 37,822 x 620
   cps19_StartDate     cps19_EndDate       cps19_ResponseId         cps19_consent
   <dttm>              <dttm>              <chr>                        <dbl+lbl>
 1 2019-09-13 08:09:44 2019-09-13 08:36:19 R_1OpYXEFGzHRUpjM 1 [I consent to par~
 2 2019-09-13 08:39:09 2019-09-13 08:57:06 R_2qdrL3J618rxYW0 1 [I consent to par~
 3 2019-09-13 10:01:19 2019-09-13 10:27:29 R_USWDAPcQEQiMmNb 1 [I consent to par~
 4 2019-09-13 10:05:37 2019-09-13 10:50:53 R_3IQaeDXy0tBzEry 1 [I consent to par~
 5 2019-09-13 10:05:52 2019-09-13 10:32:53 R_27WeMQ1asip2cMD 1 [I consent to par~
 6 2019-09-13 10:10:20 2019-09-13 10:29:45 R_3LiGZcCWJEcWV4P 1 [I consent to par~
 7 2019-09-13 10:14:47 2019-09-13 10:32:32 R_1Iu8R1UlYzVMycz 1 [I consent to par~
 8 2019-09-13 10:15:39 2019-09-13 10:30:59 R_2EcS26hqrcVYlab 1 [I consent to par~
 9 2019-09-13 10:15:48 2019-09-13 10:37:45 R_3yrt44wqQ1d4VRn 1 [I consent to par~
10 2019-09-13 10:16:08 2019-09-13 10:40:14 R_10OBmXJyvn8feYQ 1 [I consent to par~
# ... with 37,812 more rows, and 616 more variables:
#   cps19_citizenship <dbl+lbl>, cps19_yob <dbl+lbl>,
#   cps19_yob_2001_age <dbl+lbl>, cps19_gender <dbl+lbl>,
#   cps19_province <dbl+lbl>, cps19_education <dbl+lbl>, cp
...

Dataverse source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DUS88V

Yes, the main fix is using get_dataframe_by_name since you are specifying a name, as Will wrote. It seems like the datafile in question does have its own DOI (https://doi.org/10.7910/DVN/DUS88V/RZFNOV) so get_dataframe_by_doi should work on that as filedoi (with no filename argument) too.

The lack of ingest is something I've spotted recently, especially for large files. If it is something systematic, it is probably a different issue.

Here's @kuriwaki's version using dataverse::get_dataframe_by_doi(). Notice two differences

  1. the function name, and
  2. the filedoi parameter is used instead of filename (and the datafile-specific "RZFNOV" suffix is added to the doi).
dataverse::get_dataframe_by_doi(
  filedoi    = "https://doi.org/10.7910/DVN/DUS88V/RZFNOV", 
  original   = TRUE,
  .f         = haven::read_dta,
  server     = "dataverse.harvard.edu"
)

output:

# A tibble: 37,822 x 620
   cps19_StartDate     cps19_EndDate       cps19_ResponseId         cps19_consent
   <dttm>              <dttm>              <chr>                        <dbl+lbl>
 1 2019-09-13 08:09:44 2019-09-13 08:36:19 R_1OpYXEFGzHRUpjM 1 [I consent to par~
 2 2019-09-13 08:39:09 2019-09-13 08:57:06 R_2qdrL3J618rxYW0 1 [I consent to par~
 3 2019-09-13 10:01:19 2019-09-13 10:27:29 R_USWDAPcQEQiMmNb 1 [I consent to par~
 4 2019-09-13 10:05:37 2019-09-13 10:50:53 R_3IQaeDXy0tBzEry 1 [I consent to par~
 5 2019-09-13 10:05:52 2019-09-13 10:32:53 R_27WeMQ1asip2cMD 1 [I consent to par~
 6 2019-09-13 10:10:20 2019-09-13 10:29:45 R_3LiGZcCWJEcWV4P 1 [I consent to par~
 7 2019-09-13 10:14:47 2019-09-13 10:32:32 R_1Iu8R1UlYzVMycz 1 [I consent to par~
 8 2019-09-13 10:15:39 2019-09-13 10:30:59 R_2EcS26hqrcVYlab 1 [I consent to par~
 9 2019-09-13 10:15:48 2019-09-13 10:37:45 R_3yrt44wqQ1d4VRn 1 [I consent to par~
10 2019-09-13 10:16:08 2019-09-13 10:40:14 R_10OBmXJyvn8feYQ 1 [I consent to par~
# ... with 37,812 more rows, and 616 more variables:
#   cps19_citizenship <dbl+lbl>, cps19_yob <dbl+lbl>,
#   cps19_yob_2001_age <dbl+lbl>, cps19_gender <dbl+lbl>,
#   cps19_province <dbl+lbl>, cps19_education <dbl+lbl>, cps19_demsat <dbl+lbl>,
#   cps19_imp_iss <chr>, cps19_imp_iss_party <dbl+lbl>,
...

@pdurbin, this dataset doesn't have a *.tab file, unlike a test dataset with a Stata file. I see this dataset was uploaded May 2020 --did the server software start converting Stata files between then and Dec 2020?

Like @kuriwaki said, it's probably because the Stata file is on the large side at 184 MB. Dataverse can be configured with different thresholds for Excel vs. CSV vs. Stata vs. SPSS vs. RData. Basically, if the file is too big, Dataverse won't even try to ingest it. I'm not sure what these thresholds are for Harvard Dataverse. The setting is called :TabularIngestSizeLimit if you're interested: https://guides.dataverse.org/en/5.3/installation/config.html#tabularingestsizelimit

It's also possible that the Stata file failed ingest but you can only tell if you're logged in and have access.

A final possibility is that it's a newer version of Stata than Dataverse supports, which is Stata 15: https://guides.dataverse.org/en/5.3/user/tabulardataingest/supportedformats.html (I guess this would also be a failure).

@sjkiss, I believe this issue has addressed your main concern and provided answers to the auxiliary questions that popped up. Please reopen the issue if not.

Dataverse can be configured with different thresholds for Excel vs. CSV vs. Stata vs. SPSS vs. RData. Basically, if the file is too big, Dataverse won't even try to ingest it. I'm not sure what these thresholds are for Harvard Dataverse.

I just wanted to point out that after the fact even large files can be ingested as needed. Here's an example: IQSS/dataverse.harvard.edu#103

Thanks all!