glotzerlab / signac

Manage large and heterogeneous data spaces on the file system.

Home Page:https://signac.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generalize signac's data model

vyasr opened this issue · comments

Feature description

Storing data with signac currently implies a very strict data model where all data associated with a given Project is contained in a subdirectory, the workspace, that in turn contains directories of Jobs. Due to the heterogeneity supported by Jobs in a Project, the current directory layout can be leveraged to support a wide range of use cases. However, it is not always easy to do so, and can require the user to make some additional decisions about how data is organized. This issue is particularly evident when working with hierarchical data that requires some form of nested project organization, or with data that has other types of linkages that are not necessarily hierarchical but still involve some approach to linkage between two sets of data. It would be helpful for signac to support these concepts natively without consumers needing to define their own customized approaches to handling these linkages.

The two clearest use cases for this feature based on previous requests are:

  1. Nested projects: We have had multiple requests in the past where there has been a desire for each Job in a Project to itself also be a Project in order to support storage of hierarchical data. A nested layout would be one of the most obvious applications of this feature.
  2. signac-flow aggregation: Aggregation in signac-flow refers to the running of operations on subgroups of the data space, i.e. on sets of Jobs of size N>1. This approach is more general than a hierarchical data layout because there is in general no reason that a Job cannot exist in multiple aggregates, so a simple hierarchy is insufficient. However, this feature could be leveraged using a data model that implied linkages to another project. #96 would also be helpful in implementing this.

In addition, implementing this feature would allow us to explore new avenues for optimization. While the current data model assumes that the Project's index of Jobs is only stored in a distributed fashion across Jobs that requires traversing the workspace to compile, the flexibility to define a new data model would also allow us to define a layout where the index is centralized using e.g. a sqlite database stored at the project root. While this model may be less well-suited to distributed computing HPC applications, it could be far more efficient for post-processing when the same parallelization concerns do not apply. Supporting translation from one data layout to another could open up significant avenues for optimization in this regard as well. Alternatively, we could leverage the underlying flexibility of synced_collections to simply switch from encoding data in JSON files to using a more efficient backend like Redis.

Proposed solution

signac should consider adding additional information to a directory's configuration file that indicates that data layout used by that directory and its subdirectories. #922 must be implemented first so that signac can work with the concept of a directory without the assumed context of a Job or Project. The Directory should encode information about how data is stored into files or subdirectories it contains, providing a complete description of all data within the system. More information may be found in the original signac 2 prototype's design document.

Additional context

This change would require updating many of the more advanced features of signac (sync, schema, diff, clone, etc) to support arbitrary data models since they currently bake in expectations based on the current data model.

This change would also have significant impacts on downstream projects like signac-flow, which would have to become "data model aware" in order to properly iterate over jobs.