Rostlab / nala

Text mining of natural language mutations mentions

Home Page:https://www.tagtog.net/-corpora/IDP4+

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve NL Definer

juanmirocks opened this issue · comments

Stats of changes:

  • nala_training: 144 total changes, 69 from ST to sth else, 75 SS to NL
  • IDP4: 52 total changes, 23 from ST to sth else, 29 SS to NL

Stats before:

nala_tr 501 1769    1010    0.571   595 0.336   164 0.093   759 0.429   147975
IDP4    159 3337    3137    0.940   147 0.044   53  0.016   200 0.060   442515

Final stats after:

nala_tr 501 1769    941 0.532   691 0.391   137 0.077   828 0.468   -1
IDP4    159 3337    3114    0.933   192 0.058   31  0.009   223 0.067   -1

Modified algorithm will produce the following changes:

Corpus  #docs   #ann    #ST %ST #NL %NL #SS %SS #NL+SS  %NL+SS  #tokens
4 2 codon 392 (Gly----Asp) mutation  +++++++++++++++++++++++++++  SS -> NL
4 2 Lys substitution for Glu102  +++++++++++++++++++++++++++  SS -> NL
5 4 Deletion of the C-terminal  +++++++++++++++++++++++++++  SS -> NL
5 3 (p.G2D) at the N-terminus  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of p75(NTR)  +++++++++++++++++++++++++++  SS -> NL
6 2 eliminating codons 487-489 (Asp-Ser-Phe)  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 2 mutation of Tyr-838  +++++++++++++++++++++++++++  SS -> NL
3 2 mutation at Tyr-615  +++++++++++++++++++++++++++  SS -> NL
3 2 total gene deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 2 arginine-141 to serine substitution  +++++++++++++++++++++++++++  SS -> NL
4 2 arginine-141 to serine substitution  +++++++++++++++++++++++++++  SS -> NL
3 2 mutations at Arg885  +++++++++++++++++++++++++++  SS -> NL
4 3 point mutation at Cys93  +++++++++++++++++++++++++++  SS -> NL
3 2 heterozygous missense 3035G>T  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 synonymous 696T>C  ***************************  ST -> SS
2 1 missense Glu285Ala  ***************************  ST -> SS
3 1 somatic 16-bp deletion  ***************************  ST -> SS
4 2 serine 749 is phosphorylated  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 2 Ser58 to Glu substitution  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
4 3 Deletion of chromosome 11q23  +++++++++++++++++++++++++++  SS -> NL
4 1 codon 98 GAT-->GTT, Asp-->Val  ***************************  ST -> SS
3 1 codon 92, TAC-->TAT  ***************************  ST -> SS
4 2 codon 177, CAA-->TAA, Gln-->termination  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 1 codon 173 (CTA-->CGA)  ***************************  ST -> SS
3 1 Val/Met genotype  ***************************  ST -> SS
3 1 Val/Met genotype  ***************************  ST -> SS
4 2 156 in exon 5  +++++++++++++++++++++++++++  SS -> NL
4 1 one 1-bp insertion (251-insA-252)  ***************************  ST -> SS
4 1 9-bp deletion (del 192-200)  ***************************  ST -> SS
3 2 deletion of Phe52  +++++++++++++++++++++++++++  SS -> NL
4 2 deletion of aa 51/52  +++++++++++++++++++++++++++  SS -> NL
4 3 mutation at codon 692  +++++++++++++++++++++++++++  SS -> NL
4 3 Mutation in codon 713  +++++++++++++++++++++++++++  SS -> NL
3 2 mutation (codon 665Asp)  +++++++++++++++++++++++++++  SS -> NL
4 2 deletes a BglII site  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 1 intron 21, G(+1)A  ***************************  ST -> SS
3 1 intron 25, G(+5)C  ***************************  ST -> SS
3 1 intron 26, T(+2)A  ***************************  ST -> SS
4 3 Mutations at codons 717  +++++++++++++++++++++++++++  SS -> NL
4 3 mutation at codon 693  +++++++++++++++++++++++++++  SS -> NL
4 2 C to T substitution  +++++++++++++++++++++++++++  SS -> NL
4 3 mutations at codon 717  +++++++++++++++++++++++++++  SS -> NL
4 3 mutation at codon 717  +++++++++++++++++++++++++++  SS -> NL
4 3 variant at codon 717  +++++++++++++++++++++++++++  SS -> NL
4 3 mutation in codon 717  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
2 1 knock-out  ***************************  ST -> SS
3 2 Uba domain-deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 1 HP1 box deletion  ***************************  ST -> SS
3 1 Asp 296 mutants  ***************************  ST -> SS
4 2 substitution of histidine 238  +++++++++++++++++++++++++++  SS -> NL
3 1 Ala 238 mutant  ***************************  ST -> SS
4 4 Truncating the intracellular tail  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 3 Deleting the carboxy terminus  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
5 0 polyglutamine (poly-Q) repeat expansion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 486-bp deletion  ***************************  ST -> SS
4 3 altering the COOH terminus  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
5 4 removal of the pro-domain  +++++++++++++++++++++++++++  SS -> NL
2 2 deletion of  +++++++++++++++++++++++++++  SS -> NL
3 1 introduced Cys 131  ***************************  ST -> SS
3 2 Mutation of Lys-269  +++++++++++++++++++++++++++  SS -> NL
3 2 N-terminal deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 2 Mutation of arginine 78  +++++++++++++++++++++++++++  SS -> NL
4 3 removal of the methionine  +++++++++++++++++++++++++++  SS -> NL
3 2 lacking exon 15  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 2 without exon 15  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 marker D7S636  ***************************  ST -> SS
3 2 Substitution to Lys  +++++++++++++++++++++++++++  SS -> NL
3 2 substitution to Glu  +++++++++++++++++++++++++++  SS -> NL
3 2 Mutation at Thr-668  +++++++++++++++++++++++++++  SS -> NL
2 1 ACO1 deletion  ***************************  ST -> SS
4 3 mutation at codon 178  +++++++++++++++++++++++++++  SS -> NL
4 2 deletion of 1.6 kb  +++++++++++++++++++++++++++  SS -> NL
4 2 mutation at base 1258  +++++++++++++++++++++++++++  SS -> NL
3 1 985A-to-G  ***************************  ST -> SS
3 1 985A-to-G  ***************************  ST -> SS
3 1 DG91/C92, 6-bp deletion  ***************************  ST -> SS
2 1 Glu137 deletion  ***************************  ST -> SS
5 2 G/A in codon 407  +++++++++++++++++++++++++++  SS -> NL
3 2 mutation at Ser557  +++++++++++++++++++++++++++  SS -> NL
5 4 Deletion of the N-terminal  +++++++++++++++++++++++++++  SS -> NL
4 3 Single mutation of Ser-28  +++++++++++++++++++++++++++  SS -> NL
4 1 Phe- or Gly-706 mutant  ***************************  ST -> SS
4 1 Phe- and Gly-706 mutant  ***************************  ST -> SS
2 1 Phe-706 mutant  ***************************  ST -> SS
2 1 Gly-706 mutant  ***************************  ST -> SS
2 1 Gly-807 mutant  ***************************  ST -> SS
2 1 Phe-807 mutant  ***************************  ST -> SS
2 1 Knock-down  ***************************  ST -> SS
2 1 knock-down  ***************************  ST -> SS
5 5 deletion of amino-terminal sequences  +++++++++++++++++++++++++++  SS -> NL
4 2 substitution at position 387  +++++++++++++++++++++++++++  SS -> NL
3 2 substitution of W-282  +++++++++++++++++++++++++++  SS -> NL
2 2 removal of  +++++++++++++++++++++++++++  SS -> NL
4 2 Mutation of FAK Tyr-925  +++++++++++++++++++++++++++  SS -> NL
4 2 lysine-304 to glutamate substitution  +++++++++++++++++++++++++++  SS -> NL
4 2 lys329 to glu mutation  +++++++++++++++++++++++++++  SS -> NL
2 1 missense ala305-->glu  ***************************  ST -> SS
4 1 4-bp deletions (32 deltaT  ***************************  ST -> SS
2 1 deltaT deletion  ***************************  ST -> SS
2 1 the 625G-->A  ***************************  ST -> SS
2 1 625A-1147T allele  ***************************  ST -> SS
2 1 511C-625A allele  ***************************  ST -> SS
3 2 511C-625A variant allele  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 2 single 778 C>T substitution  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of S232  +++++++++++++++++++++++++++  SS -> NL
4 2 Mutation of serine 253  +++++++++++++++++++++++++++  SS -> NL
4 2 substitution of cysteine 245  +++++++++++++++++++++++++++  SS -> NL
2 1 Glis2 deletion  ***************************  ST -> SS
3 2 Mutation of Lys-426  +++++++++++++++++++++++++++  SS -> NL
4 2 mutation of tyrosine 876  +++++++++++++++++++++++++++  SS -> NL
3 2 deleted C1 domain  +++++++++++++++++++++++++++  SS -> NL
2 1 UIM-deletion  ***************************  ST -> SS
4 2 deletion of this motif  +++++++++++++++++++++++++++  SS -> NL
4 2 deletion of aa 527-534  +++++++++++++++++++++++++++  SS -> NL
4 2 deletion and loss of  +++++++++++++++++++++++++++  SS -> NL
3 3 deletion of the  +++++++++++++++++++++++++++  SS -> NL
4 2 mutations at arginine 303  +++++++++++++++++++++++++++  SS -> NL
3 2 Deletion of Ile663  +++++++++++++++++++++++++++  SS -> NL
3 2 Deletion of MK5  +++++++++++++++++++++++++++  SS -> NL
3 2 Substitution of 3K  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of exon4  +++++++++++++++++++++++++++  SS -> NL
4 2 one insertion mutation (698insC)  +++++++++++++++++++++++++++  SS -> NL
5 4 lacking the C-terminal extension  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 Asn113 mutant  ***************************  ST -> SS
2 1 Asn113 mutant  ***************************  ST -> SS
2 1 Asn79 mutant  ***************************  ST -> SS
2 1 Asn79 mutant  ***************************  ST -> SS
2 1 Ala204 mutant  ***************************  ST -> SS
3 2 single phenylalanine deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
5 1 one-base deletion (C344fs/ter)  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 3 chromosome 2q37.3 terminal deletions  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 3 Absence of exon 5  +++++++++++++++++++++++++++  SS -> NL
4 2 skipping of exon 5  +++++++++++++++++++++++++++  SS -> NL
4 2 scrambled sequence [A beta(25-35)scram]  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 2 Mutation of Arg-115  +++++++++++++++++++++++++++  SS -> NL
4 1 Ala- and Asp-115 mutant  ***************************  ST -> SS
2 2 trinucleotide deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 3-bp deletion  ***************************  ST -> SS
*** 144 69
nala_tr 501 1769    941 0.532   691 0.391   137 0.077   828 0.468   -1

21:56:02|jmcejuela|nala$ python scripts/get_corpus_stats.py IDP4
Corpus  #docs   #ann    #ST %ST #NL %NL #SS %SS #NL+SS  %NL+SS  #tokens
3 1 15.32 Mb deletion  ***************************  ST -> SS
4 3 deletion of the entire  +++++++++++++++++++++++++++  SS -> NL
4 2 deletion of exons 10-16  +++++++++++++++++++++++++++  SS -> NL
5 2 heterozygous C-to-G transversion  +++++++++++++++++++++++++++  SS -> NL
5 2 heterozygous C-to-G change  +++++++++++++++++++++++++++  SS -> NL
4 2 alanine substitution of Asp-233  +++++++++++++++++++++++++++  SS -> NL
2 1 truncation 229-233Delta  ***************************  ST -> SS
4 2 mutation within codon 355  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of ARS607  +++++++++++++++++++++++++++  SS -> NL
3 2 Deletion of ARS607  +++++++++++++++++++++++++++  SS -> NL
3 2 Deletion of REV1  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of rev1  +++++++++++++++++++++++++++  SS -> NL
3 2 ARS607, was deleted  +++++++++++++++++++++++++++  SS -> NL
3 2 Deletion of ARS607  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of ARS607  +++++++++++++++++++++++++++  SS -> NL
3 2 deleted the REV1  +++++++++++++++++++++++++++  SS -> NL
6 2 C-terminal SeCys-Gly dipeptide deleted  +++++++++++++++++++++++++++  SS -> NL
4 2 loss of exon 13  +++++++++++++++++++++++++++  SS -> NL
4 2 loss of exon 12  +++++++++++++++++++++++++++  SS -> NL
4 2 Deletion of PxVxL motif  +++++++++++++++++++++++++++  SS -> NL
4 2 Lys9 to alanine substitution  +++++++++++++++++++++++++++  SS -> NL
4 2 Asp277 residue to Ala  +++++++++++++++++++++++++++  SS -> NL
3 2 Substitution of Arg29  +++++++++++++++++++++++++++  SS -> NL
3 2 Deletion of ZF1-2  +++++++++++++++++++++++++++  SS -> NL
4 3 deletion of the ZF2-4  +++++++++++++++++++++++++++  SS -> NL
3 2 deletion of ZF1-2  +++++++++++++++++++++++++++  SS -> NL
2 2 complete deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 2 complete deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 3 single large deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 3 deletions at chromosome 3p14.1  +++++++++++++++++++++++++++  SS -> NL
4 3 single large 3p14.1p13 deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 3p14.1 deletions  ***************************  ST -> SS
2 1 3p14.1p13 deletion  ***************************  ST -> SS
2 1 3p14.1p13 deletion  ***************************  ST -> SS
5 2 C-terminal 20 a.a. deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 2 C-terminal deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
5 2 20 a.a. N-terminal deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
5 2 42 a.a. N-terminal deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 3 4 amino acid deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 2 17p11.2 duplication or deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 3 amino acid substitution N71Y  +++++++++++++++++++++++++++  SS -> NL
4 3 amino acid substitution T164A  +++++++++++++++++++++++++++  SS -> NL
3 3 4-amino acid deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
4 2 12-bp cDNA fragment deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 3 4-amino acid deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 3 15-amino acid insertion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
3 3 4-amino acid deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 deletion p.G3891_Q4020  ***************************  ST -> SS
3 2 Trp153 to termination  +++++++++++++++++++++++++++  SS -> NL
3 3 54-amino acid deletion  @@@@@@@@@@@@@@@@@@@@@@@@@@@  ST -> NL
2 1 S323-to-I  ***************************  ST -> SS
4 2 SNP in codon 24  +++++++++++++++++++++++++++  SS -> NL
*** 52 23
IDP4    159 3337    3114    0.933   192 0.058   31  0.009   223 0.067   -1