Technical input to VICC paper (draft notes)

Question

Technical input to VICC paper (draft notes)

bwalsh opened this issue 7 years ago · comments

@jgoecks @mayfielg @ahwagner : See my notes below. I simply wanted to capture them here and if it's appropriate, move them over to google docs.

G2P - Knowledgebase integration

Abstract

Background

The g2p-aggregator is a bioinformatics tool designed to integrate evidence from disparate datasources and and support interpretation, prioritization and report generation. It is implemented by Oregon Health and Sciences University (OHSU), and integrates evidence from:

The g2p-aggregator has create an open source suite based on the GA4GH schemas, which are efficiently interrogated to find sets of relevant evidence through a search api.

Methods

The g2p-aggregator integrates data coming from multiple knowedgebases and allows users to query a harmonized result set. The harmonization consists of structural and ontology mapping. Structural mapping manipulates the input stream into a GA4GH genomic feature association. The content of the original data source is maintained for queries and is returned in result sets. Ontology mapping is targeted at variants, environment (drugs), phenotypes (diseases), and evidence metadata such as evidence strength and direction. The system then stores evidence in a variety of possible stores [elastic search, kafka message queues, RDBMS or files]. A full text search, aggregations and GA4GH beacon and provided via the elastic search store. Integration with downstream systems are provided by the kafka or file system store.

Our central theme was to provide a robust search facility, giving a focus to the conversation between the researcher and the aggregated evidence. Researchers should expect a rich search experience and should be able to make judgments about the applicability of the evidence set to their research question based on the quality of one or two sets of search results.

Results

The g2p-aggregator manages data of 9 knowledgebases with a total count of over 25K evidence and clinical trial items distributed over:

9 knowledge bases
513 Genes, 7221 unique Locations
383 Diseases, 185 unique Disease Ontologies
1119 Drugs, 898 unique pubchem identifiers
7905 unique publications

Conclusions

G2p-aggregator is a useful implementation of how web-scale, open source architectures and components can be implemented to support translational research. The next steps of our project will involve the extension of its capabilities by implementing new plug-in devoted to bioinformatics data analysis as well as a temporal query module. For researchers, who need to investigate genomic events, g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. For informaticians, who need to annotate genomic events, g2p is a search tool that provides a query api for any pipeline to gather evidence ‘hits’ unlike current practices which have not focused on evidence, the product allows the informatician to identify, filter and sort genomic events based on evidence.

Background

The intent of the GA4GH schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies defined by the GA4GH G2P schemas and the individual data sources. The system's harmonization process follows the intent expressed by the original G2P task team.

Where a G2P association is between the G(enotype) in the context of
some E(environment), which gives rise to a P(henotype). These
associations have further evidence, provenance, and attribution.
We leverage the GenomicFeature in the sequenceAnnotation schema here
as it can accomodate any genomic feature from a single nucleotide variation
(SNV), up through a gene, and/or complex rearrangements. Each can
be modeled as genomic features, and generally linked to a phenotype.
Collections of these features can represent a genotype at different levels
of completeness. Therefore, we can represent single allelic variation,
allelic complement, and multiple variants in a genotype that can each or
collectively be associated with a phenotype.
To enable standardized integration, this schema relies heavily on
OntologyTerms, for typing phenotype, genomic features, and levels
of evidence.

Methods

Harvester

A harvester is a python module that implements this duck typing interface.

harvest: A fairly straightforward mechanism to use the knowledgebase's access method (api, file download, etc.) to retrieve the evidence items in their native format.
convert: Each harvester needs to map and harmonize the evidence presented to a GA4GH FeatureAssociation. This function is supported by several helper methods:
- A normalized vocabulary for evidence_level which harmonizes the source to AMP/ASCO/CAP guidelines.
- A alias and lookup service for genotype that leverages a webservice provided by EBI to lookup human disease ontology
- A lookup service for environment that leverages a webservice provided by Biothings to lookup pubchem, chebi or chembl identifiers as well as toxicity, taxonomy and approved countries.
- The COSMIC variant table to parse and harmonize variant location.

Once deployed via standard docker containers, the system extends the value of the underlying data by enabling query via a GA4GH beacon or elastic search's API. The resulting system has a minimal footprint, and is currently deployed on Amazon's free tier.

Use cases

Use cases are divided into three categories; discovery, exploration and integration.

GA4GH Beacon

The Beacon project is a project to test the willingness of international sites to share genetic data in the simplest of all technical contexts.

Our implementation follows the beacon specification and returns meta information about the beacon and a simple evidence summary for a specific genomic location.

UI

Our current alpha UI allows the user to query using a 'google search' and then presents visualizations and the ability to drill down to the specific FeaturePhenotypeAssociation and associated evidence from the original source.

As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, and a proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treatment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST to imatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of several cancer types. I could submit a query in the form GIST AND imatinib AND KIT.

In response, I will receive back a list of associations involving GIST and KIT, which I can filter for instances where imatinib is mentioned. Additionaly, the query could be extended by either drilling down on the UI's widgets and/or continuing to add full text search terms.

API

The /analysis folder contains a python notebook that leverages the entire knowledgebase for comparison with the GENIE database of clinical outcomes. We have used the Elasticsearch DSL to abstract the low level APIs (which are still available for use) to provide "a more convenient and idiomatic way to write and manipulate queries".

An example to simply retrieve all evidence items would be:

res = es.search(index=\"g2p\", size=10000, body={\"query\": {\"match_all\": {}}})

Alternative APIs are available in most commonly used programming environments.

Results

Harmonization

Evidence Level

Our first challenge was to align the diverse "strength of evidence" fields presented by different knowledgebases.

Detail: Evidence Label by source

source	filters	Count
cgi	A	304
cgi	B	49
cgi	C	563
cgi	D	515
cgi	NA	2
civic	A	62
civic	B	1131
civic	C	1003
civic	D	980
civic	NA	5
jax	A	64
jax	B	54
jax	C	647
jax	D	2884
jax	NA	3
jax_trials	D	1131
jax_trials	NA	3
molecularmatch	A	298
molecularmatch	B	73
molecularmatch	C	150
molecularmatch	D	500
molecularmatch_trials	D	64379
molecularmatch_trials	NA	885
oncokb	A	114
oncokb	B	116
oncokb	C	69
oncokb	D	97
oncokb	NA	185
pmkb	A	414
pmkb	C	160
pmkb	D	35
pmkb	NA	609
sage	C	33
sage	D	36

Phenotype & Environment

In order to enable cross knowledgebase queries, we needed a uniform Phenotype and Environment.

Detail: Exceptions to phenotype and environment harmonization

source	environment	count
molecularmatch_trials	Surgery	930
molecularmatch_trials	HSCT	745
molecularmatch_trials	Allotransplantation	349
molecularmatch_trials	Cytotoxic T Lymphocytes	138
molecularmatch_trials	RG7446	111
brca
oncokb	Debio1347	12
oncokb	AP32788	5
oncokb	BAY1436032	5
oncokb	BGB659	2
jax	N/A	39
jax	MRX-2843	11
jax	AZ8010	7
jax	TASIN-1	7
jax	BAY1187982	6
civic	AMGMDS3	8
civic	Chemotherapy	4
civic	Adjuvant Chemotherapy	2
civic	Adoptive T-cell Transfer	2
civic	Antiangiogenic Therapy	2
cgi	FGFR inhibitors	25
cgi	PARP inhibitors	21
cgi	MTOR inhibitors	17
cgi	PI3K pathway inhibitors	17
cgi	HDAC inhibitors	6
jax_trials	IDH305	2
jax_trials	INCB054828	2
jax_trials	SYM004	2
jax_trials	AC0010MA	1
jax_trials	ALRN-6924	1
molecularmatch	Sym004	5
molecularmatch	3
molecularmatch	RG7446	3
molecularmatch	ETC159	2
molecularmatch	MEDI6469	2
pmkb
sage	mTOR inhibitors	7

source	phenotype	count
molecularmatch_trials	Acute myeloid leukaemia, disease	1069
molecularmatch_trials	HIV - Human immunodeficiency virus infection	841
molecularmatch_trials	Myeloproliferative disorder	735
molecularmatch_trials	Chronic lymphoid leukaemia, disease	706
molecularmatch_trials	ALL - Acute lymphoblastic leukaemia	632
brca
oncokb	Soft Tissue Sarcoma	6
oncokb	CNS Cancer	2
oncokb	Embryonal Tumor	2
oncokb	Esophagogastric Cancer	2
oncokb	Esophageal/Stomach Cancer, NOS	1
jax	Indication other than cancer	1
civic	Desmoid Fibromatosis	9
civic	T-cell Acute Lymphoblastic Leukemia	8
civic	Epithelial Ovarian Cancer	5
civic	Hepatocellular Fibrolamellar Carcinoma	3
civic	Anaplastic Oligodendroglioma	2
cgi	Renal	17
cgi	Bladder BLCA	10
cgi	Head an neck	9
cgi	Head an neck squamous	8
cgi	Myelodisplasic proliferative syndrome	8
jax_trials
molecularmatch	Metastasis from malignant melanoma of skin	2
pmkb	MDS with Ring Sideroblasts	10
pmkb	Glial Neoplasm	9
pmkb	Histiocytic and Dendritic Cell Neoplasms	7
pmkb	Langerhans Cell Histiocytosis	7
pmkb	Other Tumor Type	5
sage	mesothelioma	2
sage	head and neck cancer	1

Genotype

Breakdown of normalized variants/biomarkers by source.

source	filters	count
brca	genomic location	5733
brca	no genomic location	0
cgi	genomic location	589
cgi	no genomic location	842
civic	genomic location	2865
civic	no genomic location	311
jax	genomic location	3009
jax	no genomic location	640
jax_trials	genomic location	1051
jax_trials	no genomic location	80
molecularmatch	genomic location	804
molecularmatch	no genomic location	217
molecularmatch_trials	genomic location	0
molecularmatch_trials	no genomic location	64379
oncokb	genomic location	1811
oncokb	no genomic location	2338
pmkb	genomic location	609
pmkb	no genomic location	0
sage	genomic location	0
sage	no genomic location	69

Analysis

GENIE Analysis: Variant Level

The AACR Project Genomics, Evidence, Neoplasia, Information, Exchange (GENIE)—clinical targeted sequencing panel data from 8 different cancer centers

G2P Knowledge Base coverage of GENIE at the variant level:

Total coverage of non-unique variants is 28%
Adding more databases increases total coverage
Different databases contribute different types of evidence
OncoKB and MolecularMatch contribute guideline recommendations
CIViC and JAX contribute substantial preclinical evidence
10% of variants associated with A-level and are highly actionable
22% of variants associated with B-level and are moderately actionable

GENIE Analysis: Donor Level

G2P coverage of GENIE donors is encouraging:

42% of donors have 1+ actionable variant
48% of donors with non-unique variant(s) also have 1+ actionable variant
Adding more databases increases total coverage
25% of donors have variant with A-level evidence, 42% of donors have variant with B-level evidence

Discussion

//TODO

Conclusions

//TODO

List of abbreviations used

//TODO

Declarations

//TODO

Acknowledgements

//TODO

Electronic supplementary material

//TODO

References

//TODO

Brian · Answer 1 · Tue Oct 31 2017 23:26:54 GMT+0800 (China Standard Time)

Follow up
From Malachi Griffith to Everyone: (08:11 AM)
VICC Paper Google Doc
https://docs.google.com/document/d/1_jA1B4G5YFo95rIwyDyc__Gx6uHfgU3FldMrF317fzU/edit
From Malachi Griffith to Everyone: (08:17 AM)
#69

Brian · Answer 2 · Thu Nov 02 2017 04:36:53 GMT+0800 (China Standard Time)

Comments from call:

chromosome name by source
Breakdown of disease names per disease ontology. Drug name per pubchem.

383 Diseases, 185 unique Disease Ontologies
1119 Drugs, 898 unique pubchem identifiers

Are there conflicts b/t knowledge bases e.g One categorizes as level A one says level B, etc.
Does Rodrigo D's presentation make sense to include in paper?
Phenotype & Environment - show examples of no-doid, no-pubchem
Composite biomarkers (ex. needed) (DT)
Protein coordinates vs Genome coordinates reverse mapping (DT)
Future work:
- integrate with wiki data?
- Alle registry vs COSMIC?

Brian · Answer 3 · Wed Nov 15 2017 11:04:14 GMT+0800 (China Standard Time)

Updated input to paper:

Done:

Chromosome name by source
Breakdown of disease names per disease ontology. Drug name per pubchem.
Phenotype & Environment - show examples of no-doid, no-pubchem
Additional knowledgebase molecularmatch clinical trials

In progress:

Phenotype harmonization: molecularmatch clinical trials
Allele registry