jax trials feature harmonization
bwalsh opened this issue · comments
Following up on our conversation re. jax trials.
Observations:
- It is unusual that all jax trials have harmonized features
- For example, all features on this evidence items have genomic location (click to see detail)
I've checked in a test to illustrate the problem harvester/tests/integration/test_jax_trials_features.py
I've validated that the MM variant endpoint does not return a false hit.
There are a chain of potential challenges and fixes:
- _parse_profile() is returning an empty mut_index
- CosmicLookup.get_entries does not check the hgvs_p parameter
- _convert always writes the first match
To run:
$ pytest -s tests/integration/test_jax_trials_features.py::test_profiles
profile >MET alterations< gene_index[i] >MET< mut_index[i] >< matches 418
profile >MET positive< gene_index[i] >MET< mut_index[i] >< matches 418
profile >MET amp< gene_index[i] >MET< mut_index[i] >< matches 418
The simplest approach is to bypass any normalization attempts if _parse_profile
does not return any mutations. The key issue is that MM's API appears to return all mutations in a gene when given something like "MET amp", which is confusing and problematic in our case.
Quick clarification, mm does properly return no mutations
I've validated that the MM variant endpoint does not return a false hit.
I've made the following change
+++ b/harvester/cosmic_lookup_table.py
@@ -39,6 +39,9 @@ class CosmicLookup(object):
# return null
logging.warning('get_entries gene: %s, hgvs_p: %s', gene, hgvs_p)
return []
+ # ensure caller passed a hgvs_p
+ if not hgvs_p or len(hgvs_p) == 0:
+ return []
# Get lookup table.
if gene in self.gene_df_cache:
# Found gene-filtered lookup table in cache.
jax and jax_trials feature_normalization went from 85%, 93% to 43%, 11% respectively
Ah, so the issue was that the COSMIC lookup table was returning all mutations, not MM. Your change looks reasonable. Thanks!
@bwalsh When I run this test I see:
profile >MET alterations< gene_index[i] >MET< mut_index[i] >< matches 0
profile >MET positive< gene_index[i] >MET< mut_index[i] >< matches 0
profile >MET amp< gene_index[i] >MET< mut_index[i] >< matches 0
And then the test fails with
E AssertionError: assert 3 == 0
E + where 3 = len([{'biomarker_type': 'mutant', 'geneSymbol': 'MET', 'name': 'MET '}, {'biomarker_type': 'polymorphism', 'geneSymbol': 'MET', 'name': 'MET positive'}, {'biomarker_type': 'polymorphism', 'geneSymbol': 'MET', 'name': 'MET amp'}])
Based on the way this test is written, I think it really should be 3. Why did we have 0 before?
thanks, will check tomorrow