The Cancer Data Aggregator (CDA) is a query engine to aggregate data across the National Cancer Institute's (NCI) Cancer Research Data Commons (CRDC). CDA will enable cancer researchers to discover, query, retrieve, and aggregate data by developing a single, interoperable, read-only API that can be used to query across disparate data types in CRDC repositories.
CDA leverages the work and data model that is concurrently being developed by the Center for Cancer Data Harmonization (CCDH). CCDH will facilitate harmonization across CRDC nodes and data coordination centers by creating a harmonized data model (CRDC-H) and will provide vocabulary services and other tools.
To enable the CDA and the CRDC-H to advance quickly, CDA maintains a data model that meets phased CDA requirements while aligning as closely as possible to released CRDC-H iterations. The CDA Data Model is a subset of the CRDC-H abstract data model with extensions and/or simplifications required to implement CDA requirements.
The CCDH data model promises to be a specimen-centric model whereas current CRDC nodes tend to use a case-centric approach. The diagrams below depicts the shift from the respective GDC and PDC entity models (provided by CCDH - Figure 2) towards a specimen-centric model (Figure 3).
As the CCDH model develops, CDA leverages the harmonization work of the CCDH model by extending the model only where necessary to support CDA functionality. For example, adding key search fields that are not yet included in the CCDH model. CDA periodically synchronizes with CCDH to maintain consistency between the CDA MVP data model and the developing CCDH model. The CDA MVP data model is expressed as JSON Schema.
In Figure 4, the entities rimmed in blue are not yet part of the CCDH model but are extensions to allow CDA to aggregate and deliver data as the CCDH model evolves. It may be helpful to think about your queries in terms of these entities (e.g. Specimen, Patient, Research Subject, Project, Diagnosis) and their attributes (e.g. derived_from_subject, ethnicity, reference_assembly).
Note that the JSON schema describes the fields for different entities and how the data connect but does not reflect how the data might be stored in a repository. The repository schema will be optimized based on the performance characteristics of the selected platform.
The Cancer Data Aggregator Application Programming Interface(API) and its design are described in the CancerDataAggregator/api github repository.
Work-in-progress
No released versions are available at this time.
Please use this repository's Issue Tracker to share comments or concerns related to the data model.