Parsing error for some large Stata files

Question

Parsing error for some large Stata files

kuriwaki opened this issue 3 years ago · comments

The download and parsing for this large file errors out though I'm not sure why.

The get_file stage succeeds; it only fails in the parsing/reading stage. It is a large Stata dta file, and we call it as not ingested even though on the GUI it appears to be ingested.

out <- get_dataframe_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG")
# Error in get_dataframe_by_id(fileid = filedoi, .f = .f, original = original,  : 
#   read-in function was left NULL, but the target file is not ingested or you asked for 
# the original version. Please supply a .f argument.

out <- get_dataframe_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG",  original = TRUE, .f = haven::read_dta)
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, 
# skip, name_repair = .name_repair) : 
#   Failed to parse /private/var/folders/58/bg0qcxrs7s7fgf3x_qlj8njw0000gn/T/Rtmpi5QTMl/foo128340268bdd: 
# This version $ of the file format is not supported.

# the raw version works.
out <- get_file_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG")
class(out)
# [1] "raw"

Will Beasley · Answer 1 · Sat Jan 30 2021 03:55:03 GMT+0800 (China Standard Time)

So is this a haven problem? Can haven convert it, independently from the dataverse package?

Shiro Kuriwaki · Answer 2 · Fri Feb 05 2021 11:08:16 GMT+0800 (China Standard Time)

It's relevant for dataverse in the sense that this was not a proble before 0.3.0 (see before and after as separate reprex's below). I think it might be something to do with the defaults in httr::GET -- maybe they were inadvertently changed.

# Version BEFORE 0.3.0 but after we patch fixed some get_file errors 
# remotes::install_github("IQSS/dataverse-client-r", 
#                         ref = "0b3d67c70cb90dbaf499d65c59972512f4a8eb7a")

library(dataverse)
library(haven)

packageVersion("dataverse")
#> [1] '0.2.1.9001'

out <- get_file(file = "CCES14_Common_Content_Validated.tab", 
                dataset  = "10.7910/DVN/XFXJVY", 
                server = "dataverse.harvard.edu")

# render the raw binary -- should be a tibble
class(read_dta(out))
#> [1] "tbl_df"     "tbl"        "data.frame"

# Current CRAN Version

library(dataverse)
library(haven)

packageVersion("dataverse")
#> [1] '0.3.0'

out <- get_file(file = "CCES14_Common_Content_Validated.tab", 
                dataset  = "10.7910/DVN/XFXJVY", 
                server = "dataverse.harvard.edu")

# render the raw binary -- should be a tibble
class(read_dta(out))
#> Error in df_parse_dta_raw(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): Failed to parse file: This version of the file format is not supported.

^{Created on 2021-02-04 by the reprex package (v0.3.0)}

pdeshlab · Answer 3 · Thu Oct 28 2021 04:10:58 GMT+0800 (China Standard Time)

Hey all! I wanted to document some issues I've been having with the dataverse package. It is also related to CES data, though I do not think it is unique to it. The package seems to have difficulty parsing large DTA files that are stored as TAB files on Dataverse

ces2016_dataverse <- get_dataframe_by_name(
     filename = "CCES16_Common_OUTPUT_Feb2018_VV.tab",
     dataset = "10.7910/DVN/GDF6Z0",
     original = TRUE,
     .f = readr::read_tsv,
     server = "dataverse.harvard.edu"
)

# Warning: 5559 parsing failures.
# See problems(...) for more details.

We use read_dta to obtain the original DTA file:

ces2012_dataverse <- get_dataframe_by_name(
  filename = "CCES12_Common_VV.tab",
  dataset = "10.7910/DVN/HQEVPK",
  original = TRUE,
  .f = haven::read_dta,
  server = "dataverse.harvard.edu"
)

But when I try to use read_dta with either the 2016 file mentioned above, my code errors out completely. Instead of getting any parsing warnings, the file is not read in at all. Here is the error message I get:

ces2016_dataverse <- get_dataframe_by_name(
  filename = "CCES16_Common_OUTPUT_Feb2018_VV.tab",
  dataset = "10.7910/DVN/GDF6Z0",
  original = TRUE,
  .f = haven::read_dta,
  progress = TRUE,
  server = "dataverse.harvard.edu" )

# Error: Failed to parse /private/var/folders/xr/xb6wvj7d3yd29p9mf04syjkw0000gn/T/Rtmp1rMbOV/fooe2db5c47f2c4: This version of the file format is not supported.

Same for filename = "CCES14_Common_Content_Validated.tab", dataset = "doi:10.7910/DVN/XFXJVY",

If I download the original DTA file onto my local computer, I can then use read_dta on it with no problems. Let me know if I can provide any other documentation to help with this issue.

Shiro Kuriwaki · Answer 4 · Thu Oct 28 2021 14:50:58 GMT+0800 (China Standard Time)

This is consistent with the first post. Your post is helpful: One possibile explanation httr::GET is corrupting the some complex parsing / delimter details of the data, and when write_dta tries to read the downloaded from tempfile, it cannot. So things do manage to download.

The "Warning: 1581 parsing failures." are warnings so may not be relevant to this thread, which is about the error. Of course it might be that the parsing failures are what is driving this. Many parse failures here are tab delimiters not working in a few dozen places in the dataset. However, the CCES12_Common_VV.tab that does work also has parsing failure warnings as well.

pdeshlab · Answer 5 · Thu Oct 28 2021 20:35:52 GMT+0800 (China Standard Time)

Shiro, when you write: "the CCES12_Common_VV.tab that does work also has parsing failure warnings as well," when do you see those warnings? When I pull this data with the code I provided in my post, my R console gives me no parsing warnings. My apologies if this is taking us too far afield, just want to make sure I follow your logic.

Shiro Kuriwaki · Answer 6 · Fri Oct 29 2021 00:13:45 GMT+0800 (China Standard Time)

To get the warning I read in 2012 as a TSV, with original = FALSE (that is, ingested = TRUE).
I see that your read_tsv code for 2014 and 2016 had original = TRUE. I'm surprised that worked at all: original = TRUE for an ingested dataset should download the original format (Stata dta) as a biinary; so read_tsv should not have been able to read that.

ces2012_dataverse <- get_dataframe_by_name(
    filename = "CCES12_Common_VV.tab",
    dataset = "10.7910/DVN/HQEVPK",
    original = FALSE,
    .f = read_tsv,
    server = "dataverse.harvard.edu")

# Warning message:
# One or more parsing issues, see `problems()` for details

For more on ingest, a Dataverse term, see here

Shiro Kuriwaki · Answer 7 · Tue Dec 28 2021 16:13:05 GMT+0800 (China Standard Time)

I've done a bit more testing by putting up various versions of the CCES data on a separate demo.dataverse.org account: https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FV14QR6

I haven't found why this fails but it might be related to ingesting. Interestingly, subsets of the problematic data (2014 and 2016) do work. I made 51 partitions of the data for each state and uploaded on to the above link. They all ingested fine and I could read them via get_dataframe_by_name no problem. However, when I tried to upload the whole 2016 data again, demo.dataverse.org said it failed to ingest it and pushed it as an original .dta (see here). This is different from the original 2016 dataset.

I'll have to ask Dataverse folks for more details about their API in the new year.

Philip Durbin · Answer 8 · Mon Jan 03 2022 23:50:25 GMT+0800 (China Standard Time)

@kuriwaki it looks like CCES16_Common_OUTPUT_Feb2018_VV.dta failed ingest. It's possibly due to the large size (546.7 MB).

In IQSS/dataverse#8271 (not merged yet) we're proposing to give the research more information about upload limits and ingest limits. Right now it looks like this:

(These limits are arbitrary and random, by the way, but I hope you get the idea.)

You are welcome to leave feedback on that pull request. Thanks!

Philip Durbin · Answer 9 · Tue Jan 04 2022 02:36:43 GMT+0800 (China Standard Time)

It's possibly due to the large size (546.7 MB).

Me again. I tried it on demo and the size is allowed but I got a strange error:

Tabular data ingest failed. Ingest failed to produce Summary Statistics and/or UNF signatures; /tmp/tempTabfile.9853244723487760894.tab (No such file or directory)

Here it is in context:

Shiro Kuriwaki · Answer 10 · Tue Jan 04 2022 02:45:26 GMT+0800 (China Standard Time)

@pdurbin: great. If it's not failing because of size, that could be a hint to why we are having trouble reading in the original version, uploaded 2018 (here). For example, I think this is not a pure Stata file but instead a Stata file that is translated from a SPSS .sav file via StatTransfer or other software, which might explain the error. I'll look in.

Does the Harvard Dataverse file that I linked to ^ show any traces of ingest issues -- especially those that a similarly sized CCES 2012 (here) does not?

and thanks for mentioning the PR -- the updated error message would certainly be useful. As you probably know, the message I got in December read "ingest process has finished with errors. ...Ingested files: CCES16_Common_OUTPUT_Feb2018_VV.dta (Error)"

Shiro Kuriwaki · Answer 11 · Fri Jan 07 2022 08:18:22 GMT+0800 (China Standard Time)

Ok, I finally have an answer for the narrow issue. I believe it fails because the detection for is_ingested is wrong. When is_ingested returns FALSE even though it is ingested, then we force nullify format (instead of keeping format ="original") which messes up the way we download the file.

If we can rewrite is_ingested that would be great, but may take a while -- I posed the question in #113.
Perhaps a quicker route for now is to change some defaults. For example, we might force format=original if we know that original=TRUE (get_dataframe_* defaults to original=FALSE now), so that we'd get the original regardless of the accuracy of is_ingested.

Shiro Kuriwaki · Answer 12 · Thu Jan 13 2022 23:55:37 GMT+0800 (China Standard Time)

This should be fixed in the new CRAN version now. The short answer is that by the particular CCES files we were dealing with seemed somewhat mal-ingested, in which the ingest did go through but it did not generate a metadata file, and we had been relying on the latter to detect. By changing to a less hacky-way in #113, we can now avoid this issue.

Thanks @pdurbin. And @pdeshlab, thanks for the examples and let me know what you find with the new CRAN version.