INCF / neuroscience-data-structure

Space for discussion of a standardized structure (directory layout + metadata) for experimental data in systems neuroscience, similar to the idea of BIDS in neuroimaging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimum directory structure for projects with multiply datatypes (e.g. ephys, behaviour, cameras)

JoeZiminski opened this issue · comments

Hi Everyone,

I am working with the neuroinformatics unit at the Sainsbury Wellcome Centre (London) to build standardised ephys analysis pipelines. We are very interested in standardized project structuring and really appreciate the work you are doing on this BEP.

We are currently thinking about the best way to handle multiple data types within a project folder organisation, very similar to your discussion in #4. For example, for a project including a single ephys session we might have:

project
  ephys
    sub-001
      ses-001
        sub-001_ses-001_task-x_ephys.nwb
  

Often, researchers will also have many behavioural sessions, including training sessions (with no ephys) and test sessions (with simultaneous ephys).

As such, we may create a second data-type folder and fill it with training sessions:

.
└── project/
    ├── ephys/
    │   └── sub-001/
    │       └── ses-001/
    │           └── sub-001_ses-001_task-x_ephys.nwb
    └── behav/
        └── sub-001/
            ├── ses-001_train/
            │   ├── camera/
            │   │   └── video.mp4
            │   └── responses/
            │       └── responses.csv
            ├── ses-002_train/
            │   └── ...
            └── ses-003_train/
                └── ...

However, it is not immediately clear the best place to put the behaviour for the ephys session. It is cleanest to place it in the behav folder (e.g. ses-004_train), and then include metadata linking it to the appropriate ephys session (and vice versa). Potential problems with this is a) it creates additional overhead for researchers to input metadata information, often when busy setting up the experimental session b) it requires additional overhead for researchers to link together their data during analysis (e.g. behav session 4 belongs with ephys session 1).

Alternatively, the behaviour folder could be placed in the ephys ses-001 folder (behav/...). This has the benefit of linking the data by location and is quite intuitive and avoids linking disparate session names (i.e. behaviour for ephys session 1 is always in the ephys session 1 folder). The downside is that it is confusing what behav-session means (i.e. it might be necessary to write an empty behav/sub-001/ses-003_test/ folder in behav that links to the ephys folder, to avoid duplicate session naming.

Finally, it is nice to store all data types under the subject / session directory, e.g.:

.
└── project/
    └── sub-001/
        ├── ses-001/
        │   ├── ephys
        │   └── behav
        ├── ses-002/
        │   └── behav
        ├── ses-003/
        │   └── behav
        └── ses-004/
            ├── ephys
            └── behav

This is probably the nicest and most intuitive overall structure and is as described in BIDS for neuroimaging. However, it mixes the data types so is a bit of an issue for researchers who do not have much coding experience, which is more common outside of neuroimaging (e.g. it is not possible to drag and drop all ephys sessions at once, for example). It also means it can be difficult to find the session you are looking for, in the case you have many behavioural training sessions interspersed with a few ephys test sessions (although this could be ameliorated by session naming e.g. ses-001_train, ses-002_test etc... for this is mixing the data-types can become confusing if you have many sessions that include various data types.

I was wondering if these issues have come up for you and what you think the best approach is in this case.

Hi Joe,

your latter structure indeed aligns the best with the existing BIDS for standard for non-ephys measurements, and also aligns with the BEP032 extension proposal for ephys.

Multiple data organizations are in principle technically equivalent; if you take a large enough group of people it is likely that they will have different preferences for organizing their data and argue for one over the other as in https://xkcd.com/927/. It is important to convey the (short and long term) value of adopting a standard that is shared with others internally and externally (and I think BIDS is a good one), to support people with local documentation, and to help them with tooling. One important aspect with tooling is that it is not only about software and libraries (like pybids), but also how you organize your data, whether "drag and drop all ephys sessions" is a valid operation or would result in wasteful data duplication, how the local network drives work, access permissions, which parts are read-only and which read-write, etc.

I hope this short reflection helps.
Robert

Hi Robert, thanks a lot for the response, that has been very useful and helped shape our approach. We will proceed with the latter structure that aligns with BIDS, feedback from some of our researchers also indicate they are already using similar / would have no trouble switching to that directory organisation.

Thanks for the insights these are good to keep in mind, as is https://xkcd.com/927/. 😄