Decisions on BIDS derivatives structure

Question

Decisions on BIDS derivatives structure

Lestropie opened this issue 2 years ago · comments

While I have written a lot of text in various locations regarding core decisions that need to be made regarding the definitions of filesystem paths for DWI derivatives, they may be too verbose or DWI-specific and therefore not be appropriate for widespread community engagement.

It is my intention to first post what I believe to be the viable solutions to these issues. Others are free to comment and even make alternative suggestions. Once the set of viable solutions is established, I will then construct polls to evaluate the degree of community consensus.

The example

We have a hypothetical DWI model called ABC. This model is represented using parameters X and Y. X and Y are of fundamentally different data types, such that it is not possible to store both in a single NIfTI image, and they must be split across multiple images.

For metadata, there is information that is relevant to model ABC as a whole, and there is additionally information that is specific to parameter X and parameter Y separately.

Following fitting of the model to the empirical data, it is possible to derive from X and Y another parameter of interest Z. This may in and of itself require metadata to explain how it was calculated.

Decision 1: Directory structure

(For the sake of discussion of directory structure, I will assume the existence of a new entity with key "model", and two new suffixes: "model", and "mdp" (model-derived parameter). This corresponds to decision 2, option 1 "few suffixes", but is used for demonstrative purposes in the context of decision 1 only, and the two decisions should be considered independent)

See also: #32

Option 1: "Complex inheritance"

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json
        sub-01_model-abc_model.json

Advantages:

No change to BIDS filesystem structure
Metadata relevant to ABC as a whole is centralised
Generalisation of inheritance principle has wider applicability
Supports even more advanced use cases. Eg. one could have, within a single model, multiple components, each of which has associated sidecar information; then, each of those components may themselves have multiple parameters necessitating their own individual sidecar information.

Disadvantages:

Requires modification to Inheritance Principle to be legal: bids-standard/bids-specification#1003
Number of files in modaity-specific directories may increase to such an extent that manual navigation becomes difficult

Option 2: "No inheritance"

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json

Advantages:

No changes to specification whatsoever

Disadvantages:

Information regarding fit of model ABC (eg. fitting parameters, software references, publication URL) must be duplicated across multiple JSONs; any downstream application that needs to know these contents would ideally need to explicitly compare these contents across JSONs to verify consistency.
Same as option 1 in that number of files within any given directory could grow very large.

Option 3: Directory hierarchy

sub-01/
    dwi/
        sub-01_model-abc_model/
            sub-01_model-abc_param-x_model.nii.gz
            sub-01_model-abc_param-x_model.json
            sub-01_model-abc_param-y_model.nii.gz
            sub-01_model-abc_param-y_model.json
            sub-01_model-abc_param-z_mdp.nii.gz
            sub-01_model-abc_param-z_mdp.json
        sub-01_model-abc_model.json

Advantages:

Natural exploitation of hierarchical nature of filesystem to reflect hierarchical nature of model data
Sets precedent for expanding modality directories to include sub-directories, which is a core component of TRX for tractography data and will therefore be requisite in the future

Disadvantages:

Requires modification of specification to permit sub-directories within modality directories
Breaks current implicit convention whereby sub-directory names don't bother duplicating entities corresponding to parents (eg. "sub-01/ses-01/dwi/"), whereas file names do (eg. "sub-01_ses-01_dwi.nii.gz"). This is impossible to resolve as long as the JSON file and corresponding sub-directory must have the same name.

Option 4: Tarballs

Option 4a: Tarball with separate JSON

sub-01/
    dwi/
        sub-01_model-abc_model.tar
        sub-01_model-abc_model.json

Contents of file sub-01_model-abc_model.tar:

sub-01_model-abc_param-x_model.nii.gz
sub-01_model-abc_param-x_model.json
sub-01_model-abc_param-y_model.nii.gz
sub-01_model-abc_param-y_model.json
sub-01_model-abc_param-z_mdp.nii.gz
sub-01_model-abc_param-z_mdp.json

Advantages:

Very compact storage; multi-resolution view of data
Tarballing is also an appealing solution for integration of non-conforming derivatives in a way that is trivially validator-compatible

Disadvantages:

BIDS Apps would need to have capability to work with tarballs (eg. unpacking and storing in scratch prior to feeding to underlying commands)
Model-derived parameters cannot be trivially added alongside the core model parameters.

PS. Apparently there's been a prior discussion regarding tarballing of non-conforming derivatives in BIDS datasets; can anyone provide a link?

Option 4b: Tarball with embedded json

sub-01/
    dwi/
        sub-01_model-abc_model.tar

Contents of file sub-01_model-abc_model.tar:

sub-01_model-abc_param-x_model.nii.gz
sub-01_model-abc_param-x_model.json
sub-01_model-abc_param-y_model.nii.gz
sub-01_model-abc_param-y_model.json
sub-01_model-abc_param-z_mdp.nii.gz
sub-01_model-abc_param-z_mdp.json
sub-01_model-abc_model.json

Advantages (relative to 4a):

Prevents potentially risky separation of model data in tarball and model sidecar data in JSON (similar to the pre-NIfTI Analyze .img / .hdr file pairs)

Disadvantages (relative to 4a):

Primary model sidecar information is not accessible without going into the tarball
Still requires more complex inheritance principle in a way; just it only applies to the contents of the tarball

Option 5: Hierarchy restricted to JSON

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_model.json

Contents of file sub-01_model-abc_model.json:

{
    "param-x_model": {
        ...
    },
    "param-y_model": {
        ...
    },
    "param-z_mdp": {
        ...
    },
    "ModelURL": "...",
    ....
}

Advantages:

No complex inheritance necessary
All information relevant to a model is visible within a single file

Disadvantages:

Necessitates explicit cross-referencing between general model JSON and individual parameter files
If model-derived parameter is to be added, metadata relating to that parameter needs to be inserted into the whole-model JSON
Metadata specific to one parameter is not immediately visible via a paired JSON

Decision 2: File names

See also: #46

(Note that for the sake of these examples, decision 1 option 1 "complex inheritance" is utilised; this is however purely for the sake of generation of examples, and the two decisions should be considered independent)

Option 1: "Few suffixes"

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json
        sub-01_model-abc_model.json

"MDP": "Model-derived parameter" (exact nomenclature can be up for debate)

Advantages:

Validator does not need to have a large number of novel suffixes added
Easy to store yet-unseen models with BIDS conformity, provided the appropriate data representations are in the specification

Disadvantages:

Not as human-readable

Option 2: "Many suffixes"

sub-01/
    dwi/
        sub-01_model-abc_x.nii.gz
        sub-01_model-abc_x.json
        sub-01_model-abc_y.nii.gz
        sub-01_model-abc_y.json
        sub-01_model-abc_z.nii.gz
        sub-01_model-abc_z.json
        sub-01_model-abc_model.json

Advantages:

Information content of individual files easily human-readable from suffix

Disadvantages:

Data from model ABC can only be stored with BIDS compatibility if model ABC is explicitly added to the specification, and the validator is updated accordingly
Appropriate filesystem path for parameter-agnostic metadata (ie. sub-01_model-xyz_model.json above) is uncertain (and could depend on decision 1 RE: directory structure)

Ariel Rokem · Answer 1 · Tue Jun 21 2022 19:54:20 GMT+0800 (China Standard Time)

This is great. Do I understand correctly that poll 2 assumes that poll 1 was already resolved and that option 2 was chosen in that poll? Do we worry that might bias poll 2 somehow? (option 2 is probably my least favorite option in poll 1, fwiw).

Robert Smith · Answer 2 · Tue Jun 21 2022 21:09:32 GMT+0800 (China Standard Time)

Added some comments to both decisions 1 and 2, to clarify that in both instances the generation of examples necessitates assuming that one option has been selected from the other decision, but that the two decisions are independent.

Franco Pestilli · Answer 3 · Thu Jun 23 2022 22:11:29 GMT+0800 (China Standard Time)

Option 5: Zipped/Tarball with complete folder
sub-01/
dwi/
sub-01_model-abc_model.tar

Contents of file sub-01_model-abc_model.tar:

sub-01_model-abc_param-x_model.nii.gz
sub-01_model-abc_param-x_model.json
sub-01_model-abc_param-y_model.nii.gz
sub-01_model-abc_param-y_model.json
sub-01_model-abc_param-z_mdp.nii.gz
sub-01_model-abc_param-z_mdp.json
sub-01_model-abc_model.json

Advantages:

similar to nii.gz
Very very compact storage; multi-resolution view of data
Tarballing is also an appealing solution for integration of non-conforming derivatives in a way that is trivially validator
compatible

Disadvantages:

Apps would need to have capability to work with tarballs

To clarify my thoughts.
Option 4 reminds me of the ANALYZE format which provided a .hdr and a .img set of files. After ANALYZE it became clear that unifying the header (.hdr) and the image (.img) into a single file .nii was convenient.

I wonder whether here we will feel the same. In other words, wha tis the cost of reading the .json inside the tarball?

Franco Pestilli · Answer 4 · Thu Jun 23 2022 22:14:29 GMT+0800 (China Standard Time)

I think Option 4 (or possibly Option 5 with some pros and cons) is the most convenient might provide some speed-ups by allowing search of the info in the top folder name only.

Ariel Rokem · Answer 5 · Thu Jun 23 2022 22:18:53 GMT+0800 (China Standard Time)

Just to see that I understand: the difference between option 4 and option 5 is that the tarball is not nested under DWI? How would we know that it's related to DWI, and discriminate it from models for FMRI or other modalities? Through the model name?

Robert Smith · Answer 6 · Thu Jun 23 2022 22:54:36 GMT+0800 (China Standard Time)

Difference between 4 and 5 is whether the JSON corresponding to model ABC as a whole is or is not embedded within the tarball. Both reside within the dwi/ modality directory. I'll probably reformat it as options 4a and 4b.

Robert Smith · Answer 7 · Sun Jul 03 2022 12:09:05 GMT+0800 (China Standard Time)

Option 5: Zipped/Tarball with complete folder

what is the cost of reading the .json inside the tarball?

Just clicked for me (and added to the dot points in the first post): This still really requires the more complex inheritance principle. If you read just one image (regardless of whether it's an intrinsic model output parameter or a model-derived parameter), both the contents of the paired sidecar JSON and the whole-model JSON are applicable.

Robert Smith · Answer 8 · Sun Jul 03 2022 12:11:07 GMT+0800 (China Standard Time)

@bids-standard/maintainers: Would very much appreciate any feedback on this thread. I can't keep up with everything happening in BIDS space, so it's possible that similar issues have been encountered elsewhere; also, any decisions made here may set a precedent for many other derivatives BEPs. After feedback from maintainers, if there's no clear consensus I'd like to open up discussion to the wider community.

Franco Pestilli · Answer 9 · Tue Aug 02 2022 11:02:47 GMT+0800 (China Standard Time)

@bids-maintenance We would like to make progress on this issue. We made a proposal, we would like to kindly request attention to allow us to move forward with the DWI-derivatives standard.

@bids-standard/maintainers: Would very much appreciate any feedback on this thread. I can't keep up with everything happening in BIDS space, so it's possible that similar issues have been encountered elsewhere; also, any decisions made here may set a precedent for many other derivatives BEPs. After feedback from maintainers, if there's no clear consensus I'd like to open up a discussion to the wider community.

@PeerHerholz @soichih @bids-standard/derivatives-mri-dwi @effigies

Stefan Appelhoff · Answer 10 · Tue Aug 02 2022 16:22:28 GMT+0800 (China Standard Time)

I find Decision 1 Option 4a appealing. Unfortunately I am not aware of the discussion that may have happened on tar balls. Maybe Chris knows, but he'll not be available for the next weeks AFAIK.

For Decision 1 Option 2 you say:

Same as option 1 in that number of files within any given directory could grow very large.

how large are we talking in the worst case?

Based on the discussion of revamping the Inheritance Principle, I am not very fond of Decision 1 Option 1.

Re: Decision 2 --> I think one of the principles in BIDS so far was to use as few suffixes as possible, as many as needed ... so that makes Option 1 appear more favorable for me.

Soichi Hayashi · Answer 11 · Tue Aug 02 2022 20:10:58 GMT+0800 (China Standard Time)

I am not too familiar with BIDS structure, but I'd like to vote for option 2 (or maybe 4a..) on decision 1 for it's simplicity.

I feel that BIDS is becoming too complex already with too many rules that I am not aware of.. I did write a few simple BIDS directory parser for our BIDS data importer library that implements (probably incorrectly) subset of all BIDS structure principles. I assume that I am not the only one who had to write such "broken" parsers as not everyone has access to libraries such as pyBIDS or can use them for their use cases. I also assume that the point of BIDS structure is to make the data structure simple/visible so that it can be used without using a dedicated libraries such a as pyBIDS if they wanted to, otherwise why not just make the whole structure closed within ".bids.tgz" type file format and provide canonical parsers for every programming languages?

No comment on Decision 2.

Peer Herholz · Answer 12 · Tue Aug 02 2022 21:53:21 GMT+0800 (China Standard Time)

Hi folks,

here are my 2 cents.

Re decision 1:

I would either vote for option 1 or 4a, depending on the tarball handling. Did the discussion/link @Lestropie refer to reappear?

Re decision 2:

+1 on @sappelhoff's comment!

Robert Smith · Answer 13 · Wed Aug 03 2022 07:31:12 GMT+0800 (China Standard Time)

how large are we talking in the worst case?

Consider two experiments:

Fitting a large number of different diffusion models in order to do data-driven discovery. Might do eg. DTI / DKI / B&S / NODDI / MSMT CSD / SMT. Imprecisely, this might be of the order of 4 + 4 + 10 + 10 + 6 + 8 ~ 40?
Fitting one model but with a range of different input parameters. Just the product of parameters per model fit and the number of different fits.

The former is perhaps less "exotic".

I think one of the principles in BIDS so far was to use as few suffixes as possible, as many as needed

That's useful. I've been leaning in favour of that myself, hence #46. Would be even better if anyone knows of a link to an explicit statement of such.

Will give a bit more time for maintainers / developers to comment / guide / suggest alternatives, but still like the idea of a community poll.

Robert Smith · Answer 14 · Mon Aug 08 2022 14:00:50 GMT+0800 (China Standard Time)

Contra-indication of interest in 4a:

Imagine one fits a model, producing a tarball of core output model parameters, and tarballing. Now one wants to use those parameters to produce a model-derived parameter (eg. an FA map from a tensor model fit). Sidecar information relating to the model fit are still applicable to the model-derived parameter. Would one therefore be obliged to unpack the tarball, add the new file(s), and repackage?

(Added to the list of disadvantages in the original post)

Stefan Appelhoff · Answer 15 · Fri Aug 26 2022 22:15:51 GMT+0800 (China Standard Time)

Would one therefore be obliged to unpack the tarball, add the new file(s), and repackage?

as the dataset curator if you do this before finally sharing your dataset ... or when sharing a new version of the data: yes, you'd have to do that and it'd be a bit laborious.

as a user of the dataset, you wouldn't want to edit the dataset anyhow, would you? Wouldn't you save your new outputs elsewhere? For example in a new (derived?) dataset, which would bring us back to the situation above, which is "a bit laborious".

Franco Pestilli · Answer 16 · Mon Aug 29 2022 08:32:43 GMT+0800 (China Standard Time)

Wouldn't you save your new outputs elsewhere?

Good point. I would say yes. That would be preferable. Saved in a new derived dataset.

Robert Smith · Answer 17 · Mon Aug 29 2022 08:42:31 GMT+0800 (China Standard Time)

Hmmm, I'm maybe here thinking of a use case outside of that intended. Sometimes I will take a dataset that's been processed using a BIDS App, and do a little bit of subsequent tweaking after the fact, eg. calculating model-derived parameters that weren't calculated by the App, and I'll try to remain vaguely BIDS-compliant when doing so. But that means that the contents no longer reflect the output of that particular App. So tarballing may make such manipulation less convenient, but maybe it's actually a good thing, as such tweaking within the purported output directory of a BIDS App should be discouraged.

The outstanding question would be whether having eg. the model parameters in one derivatives directory, and model-derived parameters in a different derivatives directory. I think that as long as the validator doesn't impose any requirement on model-derived parameters coexisting with the model parameters it should be fine, but some chance I'm overlooking something.

Robert Smith · Answer 18 · Tue Sep 13 2022 00:02:03 GMT+0800 (China Standard Time)

Originating from @oesteban

Option 5: Hierarchy restricted to JSON

sub-01/
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_model.json

Contents of file sub-01_model-abc_model.json:

{
    "param-x_model": {
        ...
    },
    "param-y_model": {
        ...
    },
    "param-z_mdp": {
        ...
    },
    "ModelURL": "...",
    ....
}

Advantages:

No complex inheritance necessary
All information relevant to a model is visible within a single file

Disadvantages:

Necessitates explicit cross-referencing between general model JSON and individual parameter files
If model-derived parameter is to be added, metadata relating to that parameter needs to be inserted into the whole-model JSON
Metadata specific to one parameter is not immediately visible via a paired JSON

Robert Smith · Answer 19 · Tue Sep 13 2022 00:45:27 GMT+0800 (China Standard Time)

Additional suggestion from @oesteban

This is described here as an augmentation of option 3; it does not in and of itself solve the complex inheritance problem.

sub-01/
    dwi/
        model-abc1/
            sub-01_model-abc1_param-x_model.nii.gz
            sub-01_model-abc1_param-x_model.json
            sub-01_model-abc1_param-y_model.nii.gz
            sub-01_model-abc1_param-y_model.json
            sub-01_model-abc1_param-z_mdp.nii.gz
            sub-01_model-abc1_param-z_mdp.json
        model-abc1.json
        model-abc2/
            sub-01_model-abc2_param-x_model.nii.gz
            sub-01_model-abc2_param-x_model.json
            sub-01_model-abc2_param-y_model.nii.gz
            sub-01_model-abc2_param-y_model.json
            sub-01_model-abc2_param-z_mdp.nii.gz
            sub-01_model-abc2_param-z_mdp.json
        model-abc2.json
        models.tsv

Must be content within file models.tsv that provides adequate identifying information for each model fit to be described both in a human-readable way, and for the validator to ensure correspondence between the unique identifiers in this file and the directory names.

Edit: models.tsv could be at the dataset root directory, and have a column encoding modality of each indexed model.

Ross Blair · Answer 20 · Thu Sep 15 2022 01:56:13 GMT+0800 (China Standard Time)

For decision 1 is the following acceptable?:

sub-01/
    sub-01_model-abc_model.json
    dwi/
        sub-01_model-abc_param-x_model.nii.gz
        sub-01_model-abc_param-x_model.json
        sub-01_model-abc_param-y_model.nii.gz
        sub-01_model-abc_param-y_model.json
        sub-01_model-abc_param-z_mdp.nii.gz
        sub-01_model-abc_param-z_mdp.json

This does not address the issue of root level model files needing to use sidecar inheritance inheritance. Also does not deal with large number of files in a directory, but should be valid in the current specification.

ah the fool I've been reading only this issue.

Robert Smith · Answer 21 · Thu Sep 15 2022 02:59:04 GMT+0800 (China Standard Time)

For decision 1 is the following acceptable?

That precisely replicates the proposed "solution" as it appears in the current specification. However as I argue in bids-standard/bids-specification#1003, it is to me unintuitive, as it involves placing datatype-specific data in a directory that has the specific purpose of disambiguating datatypes.

It could make more sense if alternatively "model-abc_model.json" could appear even higher in the filesystem tree such that it becomes applicable to that model as estimated for multiple subjects; that potentially comes with its own problems (eg. race conditions if parallel participant-level analyses attempt to write to it) but then again is a more advantageous utilisation of the inheritance principle if done properly.

Robert Smith · Answer 22 · Wed May 15 2024 11:18:22 GMT+0800 (China Standard Time)

Closing following merge of #92.