biomarker type mapping
bwalsh opened this issue · comments
re. biomarker_type
The biomarker_type list is legitimate as it comes directly from MM. They report a set of biomarker types per variant.
The lack of a cna seems to be purposeful.
# Note that 'CNA' is currently left out as it is relevant only in the case
# of CGI and directed in for loop below.
However, there may be a bug in the lookup logic.
I’ve checked in a new test harvester/tests/integration/test_mutation_type.py
def test_molecular_match_CNA():
print '"{}"={}'.format('Copy Number Variant',
norm_biomarker('Copy Number Variant'))
> assert 'copy number variant' == norm_biomarker('Copy Number Variant')
E assert 'copy number variant' == 'polymorphism'
E - copy number variant
E + polymorphism
biomarker_type
compared to fig D in https://www.biorxiv.org/content/biorxiv/early/2017/06/21/140475.full.pdf
- We have many biomarker types, they have 3 cna, fusion, mutation. ?
- Which is our equivalent to CNA?
features.biomarker_type.keyword: Descending | Count |
---|---|
mutant | 8894 |
snp | 4931 |
polymorphism | 1404 |
nonsense | 1160 |
frameshift | 1040 |
unspecified | 293 |
splice | 213 |
nonsense,insertion | 171 |
missense | 146 |
synonymous | 141 |
nonsense,deletion | 122 |
synonymous,snp | 114 |
deletion | 75 |
intervening sequence | 60 |
polymorphism,snp | 30 |
nonsense,splice | 22 |
insertion | 20 |
gain of function | 16 |
polymorphism,nonsense,insertion | 12 |
loss of heterozygosity | 8 |
polymorphism,nonsense,deletion | 8 |
snp,nonsense,insertion | 6 |
fusion,snp | 5 |
snp,polymorphism,nonsense,deletion | 5 |
indel | 4 |
splice,deletion | 4 |
splice,polymorphism,nonsense,insertion | 4 |
startloss | 3 |
5'UTR | 2 |
gain of function,snp | 2 |
insertion,snp | 2 |
polymorphism,splice | 2 |
snp,polymorphism,nonsense,insertion | 2 |
splice,insertion | 2 |
3'UTR | 1 |
nonsense,snp | 1 |
polymorphism,nonsense | 1 |
snp,nonsense,splice | 1 |
splice,deletion,nonsense | 1 |
@bwalsh Can you point me to this list on MM? I believe we can simplify this considerably; we could go down to 3 types (cnas, snps, and fusions) but I'd like to see type
and subtype
attributes ideally.
@jgoecks List of all known source biomarkers with current mapping.
I've also included a suggestion we look at sequenceontology https://bioportal.bioontology.org/ontologies/SO (in use by civic) . This would allow us to harmonize on a known standard, reduce errors and reduce maintenance.
I've included a simple parent ontology. This still produces a significant (44) number of categories. Additional reduction in categories might still be accomplished by selecting a more distant ancestor.
For example,
We currently harmonize the 'Insertion' reported by molecular march as a 'nonsense'. The SO match is 'insertion' and it's parent is 'sequence_alteration'
source | source_biomarkers | existing | proposed | ontology_id | match_type | parent | parent_ontology_id |
---|---|---|---|---|---|---|---|
civic | Insertion | nonsense | insertion | SO:0000667 | exact | sequence_alteration | SO:0001059 |
We currently harmonize the 'Chromosome Arm' reported by molecular march as a 'splice'. The SO match is 'chromosome_arm' and it's parent is 'chromosome_part'
source | source_biomarkers | existing | proposed | ontology_id | match_type | parent | parent_ontology_id |
---|---|---|---|---|---|---|---|
molecularmatch | Chromosome Arm | splice | chromosome_arm | SO:0000105 | exact | chromosome_part | SO:0000830 |
Regarding simplifying this list.
- We can tell from the spreadsheet above that consolidating by biomarker.parent as a 'type' will not reduce the number of terms enough.
- The images below show the ontology ancestors of a few randomly selected biomarkers.
You can browse them here - One alternative for consolidation is to pick the Nth earliest ancestor (in this case 2). This would produce a consolidation on:
ancestor | ontology_id |
---|---|
region | SO:0000001 |
sequence_feature | SO:0000110 |
feature_attribute | SO:0000733 |
sequence_variant | SO:0001060 |
variant_collection | SO:0001507 |
functional_variant | SO:0001536 |
structural_variant | SO:0001537 |
transcript_variant | SO:0001576 |
variant_quality | SO:0001761 |
sequence_comparison | SO:0002072 |