ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

biomarker type mapping

bwalsh opened this issue · comments

commented

re. biomarker_type

The biomarker_type list is legitimate as it comes directly from MM. They report a set of biomarker types per variant.

The lack of a cna seems to be purposeful.

# Note that 'CNA' is currently left out as it is relevant only in the case
# of CGI and directed in for loop below.
However, there may be a bug in the lookup logic.

I’ve checked in a new test harvester/tests/integration/test_mutation_type.py

    def test_molecular_match_CNA():
        print '"{}"={}'.format('Copy Number Variant',
                               norm_biomarker('Copy Number Variant'))
>       assert 'copy number variant' == norm_biomarker('Copy Number Variant')
E       assert 'copy number variant' == 'polymorphism'
E         - copy number variant
E         + polymorphism

biomarker_type
compared to fig D in https://www.biorxiv.org/content/biorxiv/early/2017/06/21/140475.full.pdf

  • We have many biomarker types, they have 3 cna, fusion, mutation. ?
  • Which is our equivalent to CNA?
features.biomarker_type.keyword: Descending Count
mutant 8894
snp 4931
polymorphism 1404
nonsense 1160
frameshift 1040
unspecified 293
splice 213
nonsense,insertion 171
missense 146
synonymous 141
nonsense,deletion 122
synonymous,snp 114
deletion 75
intervening sequence 60
polymorphism,snp 30
nonsense,splice 22
insertion 20
gain of function 16
polymorphism,nonsense,insertion 12
loss of heterozygosity 8
polymorphism,nonsense,deletion 8
snp,nonsense,insertion 6
fusion,snp 5
snp,polymorphism,nonsense,deletion 5
indel 4
splice,deletion 4
splice,polymorphism,nonsense,insertion 4
startloss 3
5'UTR 2
gain of function,snp 2
insertion,snp 2
polymorphism,splice 2
snp,polymorphism,nonsense,insertion 2
splice,insertion 2
3'UTR 1
nonsense,snp 1
polymorphism,nonsense 1
snp,nonsense,splice 1
splice,deletion,nonsense 1

@bwalsh Can you point me to this list on MM? I believe we can simplify this considerably; we could go down to 3 types (cnas, snps, and fusions) but I'd like to see type and subtype attributes ideally.

commented

@jgoecks List of all known source biomarkers with current mapping.

I've also included a suggestion we look at sequenceontology https://bioportal.bioontology.org/ontologies/SO (in use by civic) . This would allow us to harmonize on a known standard, reduce errors and reduce maintenance.

I've included a simple parent ontology. This still produces a significant (44) number of categories. Additional reduction in categories might still be accomplished by selecting a more distant ancestor.

biomarker_ontologies.xlsx

For example,

We currently harmonize the 'Insertion' reported by molecular march as a 'nonsense'. The SO match is 'insertion' and it's parent is 'sequence_alteration'

source source_biomarkers existing proposed ontology_id match_type parent parent_ontology_id
civic Insertion nonsense insertion SO:0000667 exact sequence_alteration SO:0001059

We currently harmonize the 'Chromosome Arm' reported by molecular march as a 'splice'. The SO match is 'chromosome_arm' and it's parent is 'chromosome_part'

source source_biomarkers existing proposed ontology_id match_type parent parent_ontology_id
molecularmatch Chromosome Arm splice chromosome_arm SO:0000105 exact chromosome_part SO:0000830
commented

@jgoecks

Regarding simplifying this list.

  • We can tell from the spreadsheet above that consolidating by biomarker.parent as a 'type' will not reduce the number of terms enough.
  • The images below show the ontology ancestors of a few randomly selected biomarkers.
    You can browse them here
  • One alternative for consolidation is to pick the Nth earliest ancestor (in this case 2). This would produce a consolidation on:
ancestor ontology_id
region SO:0000001
sequence_feature SO:0000110
feature_attribute SO:0000733
sequence_variant SO:0001060
variant_collection SO:0001507
functional_variant SO:0001536
structural_variant SO:0001537
transcript_variant SO:0001576
variant_quality SO:0001761
sequence_comparison SO:0002072

sample ontology ancestors

image
image
image
image