Brainstorming: path to DataLad v2?

Question

Brainstorming: path to DataLad v2?

mih opened this issue 4 months ago · comments

This is not simply about a datalad v2. This is about a strategy to reorganize the DataLad ecosystem, of which datalad, but also its extensions are only one gear in the box.

The primary aim is to create more homogeneous modules, with streamlined dependencies. Modules that decouple code bases that evolve at different paces (more stable foundation, faster iteration on prototypes and focused applications), have disjoint dependencies (not just installation, but also how much code needs to be imported to be able to use a particular piece of DataLad), have different test demands (network operations with specific services vs local code).

One (possibly more) scenario(s) will be posted below. They should be discussed regarding their individual merits and problems. This issue is about collecting idea, not about making decisions.

Please do not use this issue for discussions -- github issues don't work well for that. Rather post any alternative/derived ideas (longform) into a dedicated response. If we keep individual ideas self-contained, and also updated over time, it will be easier to refer to them and also refine them.

To communicate appreciation or opposition for individual concept, please use the "reactions" interface.

Michael Hanke · Answer 1 · Thu Feb 08 2024 21:11:12 GMT+0800 (China Standard Time)

Factor out a fundational package (FP)

The purpose of such a package would be to serve as a foundation to build DataLad-powered libraries and apps -- implemented in Python. This package is:

a Python library only (no CLI)
only code with "broad" applicability (no support for particular services or formats; but rather structures to hook such extensions on)
no user interface assumptions
wide cross-platform compatibility
modular code organization following the current datalad-next model https://github.com/datalad/datalad-next/blob/main/CONTRIBUTING.md#code-organization

The development procedures should be suitable for creating a package that radiates confidence to build 3rd-party code on

mandatory code-reviews by two or more people
release when "done"
benchmarks
mandatory "full" (something like >95%) test coverage
detailed documentation targeting developers
PRs need to be comprehensive (code, test, documentation), all at once

"Phase-in" process

The FP would be introduced gradually, by shifting and elevating code from other projects. Pretty much never would from-scratch implementations be introduced to the FP directly.

This will make sure that code has seen some usage, and some "application" code already exists downstream to illustrate concrete usage patterns, and immediately justify a code addition to serve dependent packages.

After being established, code can flow to the FP from any source, and the source project sheds that code and adds a dependency to this FP, once a release was made.

Envisioned development trajectory for "datalad/datalad"

With respect to a v2 concept, code would flow out of the present main datalad package, and it would gain the dependency on FP. It would continue to be the main entrypoint.

If and when we would approach a modernization of the CLI, we would need to reevaluate the role again. It could then become an application/meta package:

graph TD;
    FP-->datalad;
    FP-->datalad-cli;
    datalad-cli-->datalad;

or continue as a provider of assorted functionality that is exposed via different API (hence have its own CLI implementation stripped).

graph TD;
    FP-->datalad;
    FP-->datalad-cli;
    datalad-->datalad-cli;
    datalad-->datalad-gooey
    FP-->datalad-gooey

Pros

starting an FP from scratch has the benefit of laying out clear rules from the start that contributions have to follow, and all code matches them
people have expressed discomfort re the complexity of the datalad package, a bottleneck that can be avoided with a clean setup
zero impact forced onto present users of datalad. The main package can make independent decisions how to deal with changes, whether or not to grease transitions, or to provide traditional interfaces (forever)

Cons

the two-reviewer-rules is important for creating a useful (consensus) library. However, it will be hard to make a reality. @yarikoptic and @mih can do that, but when they do development themselves at least one qualified additional reviewer must be found.
introducing additions to the FP does not simultaneously improve the main package (just like with datalad-next). Demonstrations of impact (if applicable) would need to come as a companion PR to the main package (that diverts the dependency to a PR branch). This is cumbersome.

Discussion

...

Updates

the originally employed name datalad-core has been replaced by "foundational package" (FP) to reduce the ambiguity wrt the many purposes the label "core" has been used in the past

Michael Hanke · Answer 2 · Tue May 14 2024 15:12:17 GMT+0800 (China Standard Time)

An effort towards a foundational library has started at https://github.com/datalad/datasalad