Technical input to VICC paper (draft notes)
bwalsh opened this issue · comments
@jgoecks @mayfielg @ahwagner : See my notes below. I simply wanted to capture them here and if it's appropriate, move them over to google docs.
G2P - Knowledgebase integration
Abstract
Background
The g2p-aggregator is a bioinformatics tool designed to integrate evidence from disparate datasources and and support interpretation, prioritization and report generation. It is implemented by Oregon Health and Sciences University (OHSU), and integrates evidence from:
- Jackson Lab Clinical Knowledge Base
- Jackson Lab Clinical Trials
- Washington University CIViC
- Precision Oncology Knowledge Base
- Cancer Genome Interpreter Cancer
- Cornell
- Molecular match
- BRCA Exchange
- Sage Bionetworks
The g2p-aggregator has create an open source suite based on the GA4GH schemas, which are efficiently interrogated to find sets of relevant evidence through a search api.
Methods
The g2p-aggregator integrates data coming from multiple knowedgebases and allows users to query a harmonized result set. The harmonization consists of structural and ontology mapping. Structural mapping manipulates the input stream into a GA4GH genomic feature association. The content of the original data source is maintained for queries and is returned in result sets. Ontology mapping is targeted at variants, environment (drugs), phenotypes (diseases), and evidence metadata such as evidence strength and direction. The system then stores evidence in a variety of possible stores [elastic search, kafka message queues, RDBMS or files]. A full text search, aggregations and GA4GH beacon and provided via the elastic search store. Integration with downstream systems are provided by the kafka or file system store.
Our central theme was to provide a robust search facility, giving a focus to the conversation between the researcher and the aggregated evidence. Researchers should expect a rich search experience and should be able to make judgments about the applicability of the evidence set to their research question based on the quality of one or two sets of search results.
Results
The g2p-aggregator manages data of 9 knowledgebases with a total count of over 25K evidence and clinical trial items distributed over:
- 9 knowledge bases
- 513 Genes, 7221 unique Locations
- 383 Diseases, 185 unique Disease Ontologies
- 1119 Drugs, 898 unique pubchem identifiers
- 7905 unique publications
Conclusions
G2p-aggregator is a useful implementation of how web-scale, open source architectures and components can be implemented to support translational research. The next steps of our project will involve the extension of its capabilities by implementing new plug-in devoted to bioinformatics data analysis as well as a temporal query module. For researchers, who need to investigate genomic events, g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. For informaticians, who need to annotate genomic events, g2p is a search tool that provides a query api for any pipeline to gather evidence ‘hits’ unlike current practices which have not focused on evidence, the product allows the informatician to identify, filter and sort genomic events based on evidence.
Background
The intent of the GA4GH schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies defined by the GA4GH G2P schemas and the individual data sources. The system's harmonization process follows the intent expressed by the original G2P task team.
Where a G2P association is between the G(enotype) in the context of
some E(environment), which gives rise to a P(henotype). These
associations have further evidence, provenance, and attribution.
We leverage the GenomicFeature in the sequenceAnnotation schema here
as it can accomodate any genomic feature from a single nucleotide variation
(SNV), up through a gene, and/or complex rearrangements. Each can
be modeled as genomic features, and generally linked to a phenotype.
Collections of these features can represent a genotype at different levels
of completeness. Therefore, we can represent single allelic variation,
allelic complement, and multiple variants in a genotype that can each or
collectively be associated with a phenotype.
To enable standardized integration, this schema relies heavily on
OntologyTerms, for typing phenotype, genomic features, and levels
of evidence.
Methods
Harvester
A harvester
is a python module that implements this duck typing interface.
-
harvest
: A fairly straightforward mechanism to use the knowledgebase's access method (api, file download, etc.) to retrieve the evidence items in their native format. -
convert
: Each harvester needs to map and harmonize the evidence presented to a GA4GH FeatureAssociation. This function is supported by several helper methods:- A normalized vocabulary for
evidence_level
which harmonizes the source to AMP/ASCO/CAP guidelines. - A alias and lookup service for genotype that leverages a webservice provided by EBI to lookup human disease ontology
- A lookup service for environment that leverages a webservice provided by Biothings to lookup pubchem, chebi or chembl identifiers as well as toxicity, taxonomy and approved countries.
- The COSMIC variant table to parse and harmonize variant location.
- A normalized vocabulary for
Once deployed via standard docker containers, the system extends the value of the underlying data by enabling query via a GA4GH beacon or elastic search's API. The resulting system has a minimal footprint, and is currently deployed on Amazon's free tier.
Use cases
Use cases are divided into three categories; discovery, exploration and integration.
GA4GH Beacon
The Beacon project is a project to test the willingness of international sites to share genetic data in the simplest of all technical contexts.
Our implementation follows the beacon specification and returns meta information about the beacon and a simple evidence summary for a specific genomic location.
UI
Our current alpha UI allows the user to query using a 'google search' and then presents visualizations and the ability to drill down to the specific FeaturePhenotypeAssociation and associated evidence from the original source.
As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, and a proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treatment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST to imatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of several cancer types. I could submit a query in the form GIST AND imatinib AND KIT
.
In response, I will receive back a list of associations involving GIST and KIT, which I can filter for instances where imatinib is mentioned. Additionaly, the query could be extended by either drilling down on the UI's widgets and/or continuing to add full text search terms.
API
The /analysis
folder contains a python notebook that leverages the entire knowledgebase for comparison with the GENIE database of clinical outcomes. We have used the Elasticsearch DSL to abstract the low level APIs (which are still available for use) to provide "a more convenient and idiomatic way to write and manipulate queries".
An example to simply retrieve all evidence items would be:
res = es.search(index=\"g2p\", size=10000, body={\"query\": {\"match_all\": {}}})
Alternative APIs are available in most commonly used programming environments.
Results
Harmonization
Evidence Level
Our first challenge was to align the diverse "strength of evidence" fields presented by different knowledgebases.
Detail: Evidence Label by source
source | filters | Count |
---|---|---|
cgi | A | 304 |
cgi | B | 49 |
cgi | C | 563 |
cgi | D | 515 |
cgi | NA | 2 |
civic | A | 62 |
civic | B | 1131 |
civic | C | 1003 |
civic | D | 980 |
civic | NA | 5 |
jax | A | 64 |
jax | B | 54 |
jax | C | 647 |
jax | D | 2884 |
jax | NA | 3 |
jax_trials | D | 1131 |
jax_trials | NA | 3 |
molecularmatch | A | 298 |
molecularmatch | B | 73 |
molecularmatch | C | 150 |
molecularmatch | D | 500 |
molecularmatch_trials | D | 64379 |
molecularmatch_trials | NA | 885 |
oncokb | A | 114 |
oncokb | B | 116 |
oncokb | C | 69 |
oncokb | D | 97 |
oncokb | NA | 185 |
pmkb | A | 414 |
pmkb | C | 160 |
pmkb | D | 35 |
pmkb | NA | 609 |
sage | C | 33 |
sage | D | 36 |
Phenotype & Environment
In order to enable cross knowledgebase queries, we needed a uniform Phenotype and Environment.
Detail: Exceptions to phenotype and environment harmonization
source | environment | count |
---|---|---|
molecularmatch_trials | Surgery | 930 |
molecularmatch_trials | HSCT | 745 |
molecularmatch_trials | Allotransplantation | 349 |
molecularmatch_trials | Cytotoxic T Lymphocytes | 138 |
molecularmatch_trials | RG7446 | 111 |
brca | ||
oncokb | Debio1347 | 12 |
oncokb | AP32788 | 5 |
oncokb | BAY1436032 | 5 |
oncokb | BGB659 | 2 |
jax | N/A | 39 |
jax | MRX-2843 | 11 |
jax | AZ8010 | 7 |
jax | TASIN-1 | 7 |
jax | BAY1187982 | 6 |
civic | AMGMDS3 | 8 |
civic | Chemotherapy | 4 |
civic | Adjuvant Chemotherapy | 2 |
civic | Adoptive T-cell Transfer | 2 |
civic | Antiangiogenic Therapy | 2 |
cgi | FGFR inhibitors | 25 |
cgi | PARP inhibitors | 21 |
cgi | MTOR inhibitors | 17 |
cgi | PI3K pathway inhibitors | 17 |
cgi | HDAC inhibitors | 6 |
jax_trials | IDH305 | 2 |
jax_trials | INCB054828 | 2 |
jax_trials | SYM004 | 2 |
jax_trials | AC0010MA | 1 |
jax_trials | ALRN-6924 | 1 |
molecularmatch | Sym004 | 5 |
molecularmatch | 3 | |
molecularmatch | RG7446 | 3 |
molecularmatch | ETC159 | 2 |
molecularmatch | MEDI6469 | 2 |
pmkb | ||
sage | mTOR inhibitors | 7 |
source | phenotype | count |
---|---|---|
molecularmatch_trials | Acute myeloid leukaemia, disease | 1069 |
molecularmatch_trials | HIV - Human immunodeficiency virus infection | 841 |
molecularmatch_trials | Myeloproliferative disorder | 735 |
molecularmatch_trials | Chronic lymphoid leukaemia, disease | 706 |
molecularmatch_trials | ALL - Acute lymphoblastic leukaemia | 632 |
brca | ||
oncokb | Soft Tissue Sarcoma | 6 |
oncokb | CNS Cancer | 2 |
oncokb | Embryonal Tumor | 2 |
oncokb | Esophagogastric Cancer | 2 |
oncokb | Esophageal/Stomach Cancer, NOS | 1 |
jax | Indication other than cancer | 1 |
civic | Desmoid Fibromatosis | 9 |
civic | T-cell Acute Lymphoblastic Leukemia | 8 |
civic | Epithelial Ovarian Cancer | 5 |
civic | Hepatocellular Fibrolamellar Carcinoma | 3 |
civic | Anaplastic Oligodendroglioma | 2 |
cgi | Renal | 17 |
cgi | Bladder BLCA | 10 |
cgi | Head an neck | 9 |
cgi | Head an neck squamous | 8 |
cgi | Myelodisplasic proliferative syndrome | 8 |
jax_trials | ||
molecularmatch | Metastasis from malignant melanoma of skin | 2 |
pmkb | MDS with Ring Sideroblasts | 10 |
pmkb | Glial Neoplasm | 9 |
pmkb | Histiocytic and Dendritic Cell Neoplasms | 7 |
pmkb | Langerhans Cell Histiocytosis | 7 |
pmkb | Other Tumor Type | 5 |
sage | mesothelioma | 2 |
sage | head and neck cancer | 1 |
Genotype
Breakdown of normalized variants/biomarkers by source.
source | filters | count |
---|---|---|
brca | genomic location | 5733 |
brca | no genomic location | 0 |
cgi | genomic location | 589 |
cgi | no genomic location | 842 |
civic | genomic location | 2865 |
civic | no genomic location | 311 |
jax | genomic location | 3009 |
jax | no genomic location | 640 |
jax_trials | genomic location | 1051 |
jax_trials | no genomic location | 80 |
molecularmatch | genomic location | 804 |
molecularmatch | no genomic location | 217 |
molecularmatch_trials | genomic location | 0 |
molecularmatch_trials | no genomic location | 64379 |
oncokb | genomic location | 1811 |
oncokb | no genomic location | 2338 |
pmkb | genomic location | 609 |
pmkb | no genomic location | 0 |
sage | genomic location | 0 |
sage | no genomic location | 69 |
Analysis
GENIE Analysis: Variant Level
The AACR Project Genomics, Evidence, Neoplasia, Information, Exchange (GENIE)—clinical targeted sequencing panel data from 8 different cancer centers
G2P Knowledge Base coverage of GENIE at the variant level:
- Total coverage of non-unique variants is 28%
- Adding more databases increases total coverage
- Different databases contribute different types of evidence
- OncoKB and MolecularMatch contribute guideline recommendations
- CIViC and JAX contribute substantial preclinical evidence
- 10% of variants associated with A-level and are highly actionable
- 22% of variants associated with B-level and are moderately actionable
GENIE Analysis: Donor Level
G2P coverage of GENIE donors is encouraging:
- 42% of donors have 1+ actionable variant
- 48% of donors with non-unique variant(s) also have 1+ actionable variant
- Adding more databases increases total coverage
- 25% of donors have variant with A-level evidence, 42% of donors have variant with B-level evidence
Discussion
//TODO
Conclusions
//TODO
List of abbreviations used
//TODO
Declarations
//TODO
Acknowledgements
//TODO
Electronic supplementary material
//TODO
References
//TODO
Follow up
From Malachi Griffith to Everyone: (08:11 AM)
VICC Paper Google Doc
https://docs.google.com/document/d/1_jA1B4G5YFo95rIwyDyc__Gx6uHfgU3FldMrF317fzU/edit
From Malachi Griffith to Everyone: (08:17 AM)
#69
Comments from call:
-
chromosome name by source
-
Breakdown of disease names per disease ontology. Drug name per pubchem.
383 Diseases, 185 unique Disease Ontologies
1119 Drugs, 898 unique pubchem identifiers
-
Are there conflicts b/t knowledge bases e.g One categorizes as level A one says level B, etc.
-
Does Rodrigo D's presentation make sense to include in paper?
-
Phenotype & Environment - show examples of no-doid, no-pubchem
-
Composite biomarkers (ex. needed) (DT)
-
Protein coordinates vs Genome coordinates reverse mapping (DT)
-
Future work:
- integrate with wiki data?
- Alle registry vs COSMIC?
Updated input to paper:
Done:
- Chromosome name by source
- Breakdown of disease names per disease ontology. Drug name per pubchem.
- Phenotype & Environment - show examples of no-doid, no-pubchem
- Additional knowledgebase
molecularmatch clinical trials
In progress:
- Phenotype harmonization: molecularmatch clinical trials
- Allele registry