ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Technical input to VICC paper (draft notes)

bwalsh opened this issue · comments

commented

@jgoecks @mayfielg @ahwagner : See my notes below. I simply wanted to capture them here and if it's appropriate, move them over to google docs.


G2P - Knowledgebase integration

Abstract

Background

The g2p-aggregator is a bioinformatics tool designed to integrate evidence from disparate datasources and and support interpretation, prioritization and report generation. It is implemented by Oregon Health and Sciences University (OHSU), and integrates evidence from:

The g2p-aggregator has create an open source suite based on the GA4GH schemas, which are efficiently interrogated to find sets of relevant evidence through a search api.

Methods

The g2p-aggregator integrates data coming from multiple knowedgebases and allows users to query a harmonized result set. The harmonization consists of structural and ontology mapping. Structural mapping manipulates the input stream into a GA4GH genomic feature association. The content of the original data source is maintained for queries and is returned in result sets. Ontology mapping is targeted at variants, environment (drugs), phenotypes (diseases), and evidence metadata such as evidence strength and direction. The system then stores evidence in a variety of possible stores [elastic search, kafka message queues, RDBMS or files]. A full text search, aggregations and GA4GH beacon and provided via the elastic search store. Integration with downstream systems are provided by the kafka or file system store.

Our central theme was to provide a robust search facility, giving a focus to the conversation between the researcher and the aggregated evidence. Researchers should expect a rich search experience and should be able to make judgments about the applicability of the evidence set to their research question based on the quality of one or two sets of search results.

Results

The g2p-aggregator manages data of 9 knowledgebases with a total count of over 25K evidence and clinical trial items distributed over:

  • 9 knowledge bases
  • 513 Genes, 7221 unique Locations
  • 383 Diseases, 185 unique Disease Ontologies
  • 1119 Drugs, 898 unique pubchem identifiers
  • 7905 unique publications

Conclusions

G2p-aggregator is a useful implementation of how web-scale, open source architectures and components can be implemented to support translational research. The next steps of our project will involve the extension of its capabilities by implementing new plug-in devoted to bioinformatics data analysis as well as a temporal query module. For researchers, who need to investigate genomic events, g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. For informaticians, who need to annotate genomic events, g2p is a search tool that provides a query api for any pipeline to gather evidence ‘hits’ unlike current practices which have not focused on evidence, the product allows the informatician to identify, filter and sort genomic events based on evidence.

Background

The intent of the GA4GH schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies defined by the GA4GH G2P schemas and the individual data sources. The system's harmonization process follows the intent expressed by the original G2P task team.

Where a G2P association is between the G(enotype) in the context of
some E(environment), which gives rise to a P(henotype). These
associations have further evidence, provenance, and attribution.
We leverage the GenomicFeature in the sequenceAnnotation schema here
as it can accomodate any genomic feature from a single nucleotide variation
(SNV), up through a gene, and/or complex rearrangements. Each can
be modeled as genomic features, and generally linked to a phenotype.
Collections of these features can represent a genotype at different levels
of completeness. Therefore, we can represent single allelic variation,
allelic complement, and multiple variants in a genotype that can each or
collectively be associated with a phenotype.
To enable standardized integration, this schema relies heavily on
OntologyTerms, for typing phenotype, genomic features, and levels
of evidence.

Methods

Harvester

A harvester is a python module that implements this duck typing interface.

image

  • harvest: A fairly straightforward mechanism to use the knowledgebase's access method (api, file download, etc.) to retrieve the evidence items in their native format.

  • convert: Each harvester needs to map and harmonize the evidence presented to a GA4GH FeatureAssociation. This function is supported by several helper methods:

    • A normalized vocabulary for evidence_level which harmonizes the source to AMP/ASCO/CAP guidelines.
    • A alias and lookup service for genotype that leverages a webservice provided by EBI to lookup human disease ontology
    • A lookup service for environment that leverages a webservice provided by Biothings to lookup pubchem, chebi or chembl identifiers as well as toxicity, taxonomy and approved countries.
    • The COSMIC variant table to parse and harmonize variant location.

Once deployed via standard docker containers, the system extends the value of the underlying data by enabling query via a GA4GH beacon or elastic search's API. The resulting system has a minimal footprint, and is currently deployed on Amazon's free tier.

Use cases

Use cases are divided into three categories; discovery, exploration and integration.

image

GA4GH Beacon

The Beacon project is a project to test the willingness of international sites to share genetic data in the simplest of all technical contexts.

Our implementation follows the beacon specification and returns meta information about the beacon and a simple evidence summary for a specific genomic location.

UI

Our current alpha UI allows the user to query using a 'google search' and then presents visualizations and the ability to drill down to the specific FeaturePhenotypeAssociation and associated evidence from the original source.

As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, and a proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treatment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST to imatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of several cancer types. I could submit a query in the form GIST AND imatinib AND KIT.

In response, I will receive back a list of associations involving GIST and KIT, which I can filter for instances where imatinib is mentioned. Additionaly, the query could be extended by either drilling down on the UI's widgets and/or continuing to add full text search terms.

image

API

The /analysis folder contains a python notebook that leverages the entire knowledgebase for comparison with the GENIE database of clinical outcomes. We have used the Elasticsearch DSL to abstract the low level APIs (which are still available for use) to provide "a more convenient and idiomatic way to write and manipulate queries".

An example to simply retrieve all evidence items would be:

res = es.search(index=\"g2p\", size=10000, body={\"query\": {\"match_all\": {}}})

Alternative APIs are available in most commonly used programming environments.

Results

Harmonization

Evidence Level

Our first challenge was to align the diverse "strength of evidence" fields presented by different knowledgebases.

image

Detail: Evidence Label by source

source filters Count
cgi A 304
cgi B 49
cgi C 563
cgi D 515
cgi NA 2
civic A 62
civic B 1131
civic C 1003
civic D 980
civic NA 5
jax A 64
jax B 54
jax C 647
jax D 2884
jax NA 3
jax_trials D 1131
jax_trials NA 3
molecularmatch A 298
molecularmatch B 73
molecularmatch C 150
molecularmatch D 500
molecularmatch_trials D 64379
molecularmatch_trials NA 885
oncokb A 114
oncokb B 116
oncokb C 69
oncokb D 97
oncokb NA 185
pmkb A 414
pmkb C 160
pmkb D 35
pmkb NA 609
sage C 33
sage D 36

Phenotype & Environment

In order to enable cross knowledgebase queries, we needed a uniform Phenotype and Environment.

image

Detail: Exceptions to phenotype and environment harmonization

source environment count
molecularmatch_trials Surgery 930
molecularmatch_trials HSCT 745
molecularmatch_trials Allotransplantation 349
molecularmatch_trials Cytotoxic T Lymphocytes 138
molecularmatch_trials RG7446 111
brca    
oncokb Debio1347 12
oncokb AP32788 5
oncokb BAY1436032 5
oncokb BGB659 2
jax N/A 39
jax MRX-2843 11
jax AZ8010 7
jax TASIN-1 7
jax BAY1187982 6
civic AMGMDS3 8
civic Chemotherapy 4
civic Adjuvant Chemotherapy 2
civic Adoptive T-cell Transfer 2
civic Antiangiogenic Therapy 2
cgi FGFR inhibitors 25
cgi PARP inhibitors 21
cgi MTOR inhibitors 17
cgi PI3K pathway inhibitors 17
cgi HDAC inhibitors 6
jax_trials IDH305 2
jax_trials INCB054828 2
jax_trials SYM004 2
jax_trials AC0010MA 1
jax_trials ALRN-6924 1
molecularmatch Sym004 5
molecularmatch 3
molecularmatch RG7446 3
molecularmatch ETC159 2
molecularmatch MEDI6469 2
pmkb    
sage mTOR inhibitors 7
source phenotype count
molecularmatch_trials Acute myeloid leukaemia, disease 1069
molecularmatch_trials HIV - Human immunodeficiency virus infection 841
molecularmatch_trials Myeloproliferative disorder 735
molecularmatch_trials Chronic lymphoid leukaemia, disease 706
molecularmatch_trials ALL - Acute lymphoblastic leukaemia 632
brca    
oncokb Soft Tissue Sarcoma 6
oncokb CNS Cancer 2
oncokb Embryonal Tumor 2
oncokb Esophagogastric Cancer 2
oncokb Esophageal/Stomach Cancer, NOS 1
jax Indication other than cancer 1
civic Desmoid Fibromatosis 9
civic T-cell Acute Lymphoblastic Leukemia 8
civic Epithelial Ovarian Cancer 5
civic Hepatocellular Fibrolamellar Carcinoma 3
civic Anaplastic Oligodendroglioma 2
cgi Renal 17
cgi Bladder BLCA 10
cgi Head an neck 9
cgi Head an neck squamous 8
cgi Myelodisplasic proliferative syndrome 8
jax_trials    
molecularmatch Metastasis from malignant melanoma of skin 2
pmkb MDS with Ring Sideroblasts 10
pmkb Glial Neoplasm 9
pmkb Histiocytic and Dendritic Cell Neoplasms 7
pmkb Langerhans Cell Histiocytosis 7
pmkb Other Tumor Type 5
sage mesothelioma 2
sage head and neck cancer 1

Genotype

Breakdown of normalized variants/biomarkers by source.
image

source filters count
brca genomic location 5733
brca no genomic location 0
cgi genomic location 589
cgi no genomic location 842
civic genomic location 2865
civic no genomic location 311
jax genomic location 3009
jax no genomic location 640
jax_trials genomic location 1051
jax_trials no genomic location 80
molecularmatch genomic location 804
molecularmatch no genomic location 217
molecularmatch_trials genomic location 0
molecularmatch_trials no genomic location 64379
oncokb genomic location 1811
oncokb no genomic location 2338
pmkb genomic location 609
pmkb no genomic location 0
sage genomic location 0
sage no genomic location 69

Analysis

GENIE Analysis: Variant Level

The AACR Project Genomics, Evidence, Neoplasia, Information, Exchange (GENIE)—clinical targeted sequencing panel data from 8 different cancer centers

G2P Knowledge Base coverage of GENIE at the variant level:

  • Total coverage of non-unique variants is 28%
  • Adding more databases increases total coverage
  • Different databases contribute different types of evidence
  • OncoKB and MolecularMatch contribute guideline recommendations
  • CIViC and JAX contribute substantial preclinical evidence
  • 10% of variants associated with A-level and are highly actionable
  • 22% of variants associated with B-level and are moderately actionable

image

GENIE Analysis: Donor Level

G2P coverage of GENIE donors is encouraging:

  • 42% of donors have 1+ actionable variant
  • 48% of donors with non-unique variant(s) also have 1+ actionable variant
  • Adding more databases increases total coverage
  • 25% of donors have variant with A-level evidence, 42% of donors have variant with B-level evidence

image

Discussion

//TODO

Conclusions

//TODO

List of abbreviations used

//TODO

Declarations

//TODO

Acknowledgements

//TODO

Electronic supplementary material

//TODO

References

//TODO

commented

Follow up
From Malachi Griffith to Everyone: (08:11 AM)
VICC Paper Google Doc
https://docs.google.com/document/d/1_jA1B4G5YFo95rIwyDyc__Gx6uHfgU3FldMrF317fzU/edit
From Malachi Griffith to Everyone: (08:17 AM)
#69

commented

Comments from call:

  • chromosome name by source

  • Breakdown of disease names per disease ontology. Drug name per pubchem.

383 Diseases, 185 unique Disease Ontologies
1119 Drugs, 898 unique pubchem identifiers

  • Are there conflicts b/t knowledge bases e.g One categorizes as level A one says level B, etc.

  • Does Rodrigo D's presentation make sense to include in paper?

  • Phenotype & Environment - show examples of no-doid, no-pubchem

  • Composite biomarkers (ex. needed) (DT)

  • Protein coordinates vs Genome coordinates reverse mapping (DT)

  • Future work:

    • integrate with wiki data?
    • Alle registry vs COSMIC?
commented

Updated input to paper:

Done:

  • Chromosome name by source
  • Breakdown of disease names per disease ontology. Drug name per pubchem.
  • Phenotype & Environment - show examples of no-doid, no-pubchem
  • Additional knowledgebase molecularmatch clinical trials

In progress:

  • Phenotype harmonization: molecularmatch clinical trials
  • Allele registry