NIH Common Fund - SPARC Dataset Structure

Question

NIH Common Fund - SPARC Dataset Structure

jgrethe opened this issue 4 years ago · comments

Jeffrey S. Grethe, Ph.D. commented 4 years ago

There is an effort within the SPARC program to develop such a structure:
https://sparc.science/help/3FXikFXC8shPRd8xZqhjVT

A white paper is about to be published. There is also some tooling being developed to assist researchers in migrating files to the structure as well as tools for validation.

Sylvain Takerkart · Answer 1 · Mon Feb 01 2021 18:51:00 GMT+0800 (China Standard Time)

Thanks @jgrethe for this post!

I think this is the same initiative as what @tgbugs described e.g here: #4 (comment)), correct? (if so, I propose to close this issue to keep things centralized in just one thread... ok with you @jgrethe ?)

Cheers,

Sylvain

Yaroslav Halchenko · Answer 2 · Mon Feb 01 2021 23:11:47 GMT+0800 (China Standard Time)

.xlsx files for seemingly trivial tabular data - yikes! anyone knows what was motivation for going with that beast instead of a simple .tsv?

Jean-Baptiste Poline · Answer 3 · Tue Feb 02 2021 00:14:57 GMT+0800 (China Standard Time)

👍
This sounds something worth correcting

Tom Gillespie · Answer 4 · Tue Feb 02 2021 07:37:48 GMT+0800 (China Standard Time)

.tsv, .csv, and .json are all also supported. The reason .xlsx is supported is because originally it is easier for non-technicals to work with in their existing workflows, and because we needed an additional layer in the files to be able to communicate required vs optional fields. Over time another reason is because it is next to impossible to get non-technical users to fix bad file encodings (e.g. latin-1 encodings). Also we have found that users struggle with tsv vs csv vs semi-colon separated, and using the defaults for them avoids many layers of confusion.

There is some tension between deposition format (xlsx) and other more interoperable formats that we might like to publish with the dataset. Right now we have only implemented functions that go from xlsx -> json, but have plans to implement going in the other direction as well, so that the xlsx file could serve purely as a user interface and never actually appear in the published dataset.

Tom Gillespie · Answer 5 · Tue Feb 02 2021 07:42:33 GMT+0800 (China Standard Time)

@SylvainTakerkart yes, same one I mention in #4 (comment).

Yaroslav Halchenko · Answer 6 · Tue Feb 02 2021 08:17:36 GMT+0800 (China Standard Time)

So every tool supporting this format for output needs to be able to write xlsx and ensure consistent dumping also in all other formats? In other words: Multiplicity of possible data representations IMHO just brings possible inconsistency, difficulty in I/O, and for unclear benefit, since Excel etc open tsv just fine.

Tom Gillespie · Answer 7 · Tue Feb 02 2021 08:35:18 GMT+0800 (China Standard Time)

@yarikoptic no. Writing xlsx is only needed to make the life of the user easier if they are depositing data in xlsx format. In the minimal case writing xlsx would not be required, and for publication we might replace the xlsx files with tsv or json so that people who wanted to use the dataset did not have to deal with parsing the xlsx files.

In the minimal case a validator would just read the xlsx file in and tell the user "this is malformed." That validation is implemented at 3 levels, xlsx -> generic tabular, tabular -> json, and json. Only the xlsx -> generic tabular step needs additional work beyond csv/tsv. In the maximal case it can be easier to show users malformed errors by writing another xlsx file with all the bad fields marked in red. If you were doing this via a web interface there are other options and of course the user might never interact with the underlying json structure at all.

edit: with regard to possible inconsistency, we have found that the more steps away from default that a user has to take, the more likely they are to produce inconsistent data. By supporting the defaults that 90% of our data depositors experience, we cut out a lot of steps that they can screw up.

In short, there are more human errors that can happen when using tsv and csv that are significantly harder to fix than any of the implementation issues that might or might not be encountered when using xlsx. Note that I think that this is true despite the fact that the current implementation of the validation pipelines always run two parsers for xlsx files so that we can catch different sets of errors. Better to do that than to try to get 20 different labs to change how they save their files on 3 operating systems and 5 different localization defaults (actually probably more operating systems because some labs are probably still on windows xp for some of their data acquisition computers).

Oliver Ruebel · Answer 8 · Thu Feb 18 2021 02:50:57 GMT+0800 (China Standard Time)

The paper on the SPARC Data Structure is here https://doi.org/10.1101/2021.02.10.430563