Error in prepend_doi(filedoi) : argument "filedoi" is missing, with no default

Question

Error in prepend_doi(filedoi) : argument "filedoi" is missing, with no default

sjkiss opened this issue 3 years ago · comments

I would like to download the stata tab file here but am getting this error. Running dataverse 0.3.7

Note that I'm pretty sure I've set the server and the key properly.

Error in prepend_doi(filedoi) : 
  argument "filedoi" is missing, with no default

## code goes here

library(dataverse)
get_dataframe_by_doi(filename="2019 Canadian Election Study - Online Survey v1.0.tab", 
                     original=T, .f=haven::read_dta,dataset="doi:10.7910/DVN/DUS88V", 
                     server="dataverse.harvard.edu")



## session info for your system
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dataverse_0.3.7 labelled_2.8.0  cesdata_0.1.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        knitr_1.31        magrittr_2.0.1   
 [4] hms_1.0.0         tidyselect_1.1.0  R6_2.5.0         
 [7] rlang_0.4.10      fansi_0.4.2       stringr_1.4.0    
[10] dplyr_1.0.5       tools_4.0.4       xfun_0.22        
[13] utf8_1.1.4        DBI_1.1.1         htmltools_0.5.1.1
[16] ellipsis_0.3.1    assertthat_0.2.1  yaml_2.2.1       
[19] digest_0.6.27     tibble_3.1.0      lifecycle_1.0.0  
[22] crayon_1.4.1      tidyr_1.1.3       purrr_0.3.4      
[25] vctrs_0.3.6       glue_1.4.2        evaluate_0.14    
[28] haven_2.3.1       rmarkdown_2.7     stringi_1.5.3    
[31] compiler_4.0.4    pillar_1.5.1      forcats_0.5.1    
[34] generics_0.1.0    pkgconfig_2.0.3

Will Beasley · Answer 1 · Tue Mar 30 2021 23:44:39 GMT+0800 (China Standard Time)

@sjkiss, thanks for filing such a thorough issue. I think you can get your desired output with two changes.

change dataverse::get_dataframe_by_doi() to dataverse::get_dataframe_by_name()
change file extension from "tab" (the file created by Dataverse during the ingestion) to "dta" (the original Stata file extension), because you specified original = TRUE

@pdurbin, this dataset doesn't have a *.tab file, unlike a test dataset with a Stata file. I see this dataset was uploaded May 2020 --did the server software start converting Stata files between then and Dec 2020?

dataverse::get_dataframe_by_name(
  filename   = "2019 Canadian Election Study - Online Survey v1.0.dta", 
  original   = TRUE, 
  .f         = haven::read_dta,
  dataset    = "doi:10.7910/DVN/DUS88V", 
  server     = "dataverse.harvard.edu"
)

output:

# A tibble: 37,822 x 620
   cps19_StartDate     cps19_EndDate       cps19_ResponseId         cps19_consent
   <dttm>              <dttm>              <chr>                        <dbl+lbl>
 1 2019-09-13 08:09:44 2019-09-13 08:36:19 R_1OpYXEFGzHRUpjM 1 [I consent to par~
 2 2019-09-13 08:39:09 2019-09-13 08:57:06 R_2qdrL3J618rxYW0 1 [I consent to par~
 3 2019-09-13 10:01:19 2019-09-13 10:27:29 R_USWDAPcQEQiMmNb 1 [I consent to par~
 4 2019-09-13 10:05:37 2019-09-13 10:50:53 R_3IQaeDXy0tBzEry 1 [I consent to par~
 5 2019-09-13 10:05:52 2019-09-13 10:32:53 R_27WeMQ1asip2cMD 1 [I consent to par~
 6 2019-09-13 10:10:20 2019-09-13 10:29:45 R_3LiGZcCWJEcWV4P 1 [I consent to par~
 7 2019-09-13 10:14:47 2019-09-13 10:32:32 R_1Iu8R1UlYzVMycz 1 [I consent to par~
 8 2019-09-13 10:15:39 2019-09-13 10:30:59 R_2EcS26hqrcVYlab 1 [I consent to par~
 9 2019-09-13 10:15:48 2019-09-13 10:37:45 R_3yrt44wqQ1d4VRn 1 [I consent to par~
10 2019-09-13 10:16:08 2019-09-13 10:40:14 R_10OBmXJyvn8feYQ 1 [I consent to par~
# ... with 37,812 more rows, and 616 more variables:
#   cps19_citizenship <dbl+lbl>, cps19_yob <dbl+lbl>,
#   cps19_yob_2001_age <dbl+lbl>, cps19_gender <dbl+lbl>,
#   cps19_province <dbl+lbl>, cps19_education <dbl+lbl>, cp
...

Dataverse source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DUS88V

Shiro Kuriwaki · Answer 2 · Tue Mar 30 2021 23:49:11 GMT+0800 (China Standard Time)

Yes, the main fix is using get_dataframe_by_name since you are specifying a name, as Will wrote. It seems like the datafile in question does have its own DOI (https://doi.org/10.7910/DVN/DUS88V/RZFNOV) so get_dataframe_by_doi should work on that as filedoi (with no filename argument) too.

The lack of ingest is something I've spotted recently, especially for large files. If it is something systematic, it is probably a different issue.

Will Beasley · Answer 3 · Tue Mar 30 2021 23:56:40 GMT+0800 (China Standard Time)

Here's @kuriwaki's version using dataverse::get_dataframe_by_doi(). Notice two differences

the function name, and
the filedoi parameter is used instead of filename (and the datafile-specific "RZFNOV" suffix is added to the doi).

dataverse::get_dataframe_by_doi(
  filedoi    = "https://doi.org/10.7910/DVN/DUS88V/RZFNOV", 
  original   = TRUE,
  .f         = haven::read_dta,
  server     = "dataverse.harvard.edu"
)

output:

# A tibble: 37,822 x 620
   cps19_StartDate     cps19_EndDate       cps19_ResponseId         cps19_consent
   <dttm>              <dttm>              <chr>                        <dbl+lbl>
 1 2019-09-13 08:09:44 2019-09-13 08:36:19 R_1OpYXEFGzHRUpjM 1 [I consent to par~
 2 2019-09-13 08:39:09 2019-09-13 08:57:06 R_2qdrL3J618rxYW0 1 [I consent to par~
 3 2019-09-13 10:01:19 2019-09-13 10:27:29 R_USWDAPcQEQiMmNb 1 [I consent to par~
 4 2019-09-13 10:05:37 2019-09-13 10:50:53 R_3IQaeDXy0tBzEry 1 [I consent to par~
 5 2019-09-13 10:05:52 2019-09-13 10:32:53 R_27WeMQ1asip2cMD 1 [I consent to par~
 6 2019-09-13 10:10:20 2019-09-13 10:29:45 R_3LiGZcCWJEcWV4P 1 [I consent to par~
 7 2019-09-13 10:14:47 2019-09-13 10:32:32 R_1Iu8R1UlYzVMycz 1 [I consent to par~
 8 2019-09-13 10:15:39 2019-09-13 10:30:59 R_2EcS26hqrcVYlab 1 [I consent to par~
 9 2019-09-13 10:15:48 2019-09-13 10:37:45 R_3yrt44wqQ1d4VRn 1 [I consent to par~
10 2019-09-13 10:16:08 2019-09-13 10:40:14 R_10OBmXJyvn8feYQ 1 [I consent to par~
# ... with 37,812 more rows, and 616 more variables:
#   cps19_citizenship <dbl+lbl>, cps19_yob <dbl+lbl>,
#   cps19_yob_2001_age <dbl+lbl>, cps19_gender <dbl+lbl>,
#   cps19_province <dbl+lbl>, cps19_education <dbl+lbl>, cps19_demsat <dbl+lbl>,
#   cps19_imp_iss <chr>, cps19_imp_iss_party <dbl+lbl>,
...

Philip Durbin · Answer 4 · Wed Mar 31 2021 02:01:44 GMT+0800 (China Standard Time)

@pdurbin, this dataset doesn't have a *.tab file, unlike a test dataset with a Stata file. I see this dataset was uploaded May 2020 --did the server software start converting Stata files between then and Dec 2020?

Like @kuriwaki said, it's probably because the Stata file is on the large side at 184 MB. Dataverse can be configured with different thresholds for Excel vs. CSV vs. Stata vs. SPSS vs. RData. Basically, if the file is too big, Dataverse won't even try to ingest it. I'm not sure what these thresholds are for Harvard Dataverse. The setting is called :TabularIngestSizeLimit if you're interested: https://guides.dataverse.org/en/5.3/installation/config.html#tabularingestsizelimit

It's also possible that the Stata file failed ingest but you can only tell if you're logged in and have access.

A final possibility is that it's a newer version of Stata than Dataverse supports, which is Stata 15: https://guides.dataverse.org/en/5.3/user/tabulardataingest/supportedformats.html (I guess this would also be a failure).

Will Beasley · Answer 5 · Wed Mar 31 2021 02:13:48 GMT+0800 (China Standard Time)

@sjkiss, I believe this issue has addressed your main concern and provided answers to the auxiliary questions that popped up. Please reopen the issue if not.

Philip Durbin · Answer 6 · Thu Apr 01 2021 22:58:31 GMT+0800 (China Standard Time)

Dataverse can be configured with different thresholds for Excel vs. CSV vs. Stata vs. SPSS vs. RData. Basically, if the file is too big, Dataverse won't even try to ingest it. I'm not sure what these thresholds are for Harvard Dataverse.

I just wanted to point out that after the fact even large files can be ingested as needed. Here's an example: IQSS/dataverse.harvard.edu#103

sjkiss · Answer 7 · Thu Apr 08 2021 23:03:20 GMT+0800 (China Standard Time)

Thanks all!