lasigeBioTM / MER

Minimal Named-Entity Recognizer (MER)

Home Page:http://labs.fc.ul.pt/mer/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GitHub stars

MER (Minimal Named-Entity Recognizer)

MER is a Named-Entity Recognition tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations).

Given an ontology (owl file) MER is also able to link the entities to their classes.

A demo is also available at: http://labs.fc.ul.pt/mer/

** NEW **

References:

Dependencies

awk

MER was developed and tested using the GNU awk (gawk) and grep. If you have another awk interpreter in your machine, there's no assurance that the program will work.

For example, to install GNU awk on Ubuntu:

sudo apt-get install gawk

Basic Usage Example

Pre-Processing of a Lexicon

Let's walk trough an example of adding a sample lexicon to MER.

First, we have to create the lexicon file in the data folder:

α-maltose
nicotinic acid
nicotinic acid D-ribonucleotide
nicotinic acid-adenine dinucleotide phosphate

Assuming that the file is called lexicon.txt, you process it as follows:

(cd data; ../produce_data_files.sh lexicon.txt)

Some examples of labels are shown as output so you can check that it worked. This will create all the necessary files to use MER with the given lexicon.

Recognizing Entities

The script receives as input a text and a lexicon:

./get_entities.sh [text] [lexicon]

So, let's try to find mentions in a snippet of text:

./get_entities.sh 'α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide' lexicon

The output will be a TSV looking like this:

0         9       α-maltose
14        28        nicotinic acid
48        62        nicotinic acid
48        79        nicotinic acid D-ribonucleotide

The first column corresponds to the start-index, the second to the end-index and the third to the annotated term.

Test

To check if the result is what was expected try:

./test.sh

if something is wrong, please check if you are using UTF-8 encoding and that you have GNU awk and grep installed.

Linking Entities

If you create a links file named lexicon_links.tsv in the data folder associating each label (in lower case) with an URI:

α-maltose                                       http://purl.obolibrary.org/obo/CHEBI_18167
nicotinic acid                                  http://purl.obolibrary.org/obo/CHEBI_15940
nicotinic acid d-ribonucleotide                 http://purl.obolibrary.org/obo/CHEBI_15763
nicotinic acid-adenine dinucleotide phosphate   http://purl.obolibrary.org/obo/CHEBI_76072

Then the mentions in a snippet of text will be associated to the respective identifier:

./get_entities.sh 'α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide' lexicon

The output will be a TSV looking like this:

0       9       α-maltose                       http://purl.obolibrary.org/obo/CHEBI_18167
14      28      nicotinic acid                  http://purl.obolibrary.org/obo/CHEBI_15940
48      62      nicotinic acid                  http://purl.obolibrary.org/obo/CHEBI_15940
48      79      nicotinic acid D-ribonucleotide http://purl.obolibrary.org/obo/CHEBI_15763

PubMed

Download an abstract from PubMed, for example 31319702:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=31319702&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

You can add more abstracts by adding more PubMed ids separated by comma.

Recognize the entities in the abstract:

./get_entities.sh "$text" lexicon

The output should be something like this:

1789      1803      nicotinic acid      http://purl.obolibrary.org/obo/CHEBI_15940
1984      1998      nicotinic acid      http://purl.obolibrary.org/obo/CHEBI_15940

Ontologies

Gene Ontology (GO) Example

Download the ontology:

(cd data; curl -L -O http://purl.obolibrary.org/obo/go.owl)

Process it:

(cd data; ../produce_data_files.sh go.owl)

Now, download an abstract from PubMed, for example 31351426:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=31351426&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" go

The output should be something like this:

185       198       fertilization          http://purl.obolibrary.org/obo/GO_0009566 
289       298       signaling              http://purl.obolibrary.org/obo/GO_0023052 
1162      1171      signaling              http://purl.obolibrary.org/obo/GO_0023052 
1285      1294      signaling              http://purl.obolibrary.org/obo/GO_0023052 
1552      1561      signaling              http://purl.obolibrary.org/obo/GO_0023052 
1649      1659      metabolism             http://purl.obolibrary.org/obo/GO_0008152 
1867      1876      signaling              http://purl.obolibrary.org/obo/GO_0023052 
1989      2001      pathogenesis           http://purl.obolibrary.org/obo/GO_0009405 
289       306       signaling pathway      http://purl.obolibrary.org/obo/GO_0007165 
1162      1179      signaling pathway      http://purl.obolibrary.org/obo/GO_0007165 
1285      1302      signaling pathway      http://purl.obolibrary.org/obo/GO_0007165 
1303      1318      gene expression        http://purl.obolibrary.org/obo/GO_0010467 
1552      1569      signaling pathway      http://purl.obolibrary.org/obo/GO_0007165 
1641      1659      glucose metabolism     http://purl.obolibrary.org/obo/GO_0006006 
1661      1682      inflammatory response  http://purl.obolibrary.org/obo/GO_0006954 
1867      1884      signaling pathway      http://purl.obolibrary.org/obo/GO_0007165 
284       306       PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357 
1157      1179      PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357 
1280      1302      PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357 
1547      1569      PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357 
1862      1884      PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357 

Chemical Entities of Biological Interest (ChEBI)

Download the ontology:

(cd data; curl -L -O  http://purl.obolibrary.org/obo/chebi/chebi_lite.owl)

Process it:

(cd data; ../produce_data_files.sh chebi_lite.owl)

Download an abstract from PubMed, for example 31319702:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=31319702&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" chebi_lite

The output should be something like this:

0         8         Electron  http://purl.obolibrary.org/obo/CHEBI_10545 
160       165       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
213       218       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
285       290       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
342       347       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
397       402       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
475       480       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
939       943       atom      http://purl.obolibrary.org/obo/CHEBI_33250 
963       968       ester     http://purl.obolibrary.org/obo/CHEBI_35701 
1016      1020      acid      http://purl.obolibrary.org/obo/CHEBI_37527 
1033      1040      propene   http://purl.obolibrary.org/obo/CHEBI_16052 
1094      1099      ester     http://purl.obolibrary.org/obo/CHEBI_35701 
1149      1153      acid      http://purl.obolibrary.org/obo/CHEBI_37527 
1172      1179      radical   http://purl.obolibrary.org/obo/CHEBI_26519 
1231      1237      methyl    http://purl.obolibrary.org/obo/CHEBI_29309 
1413      1419      methyl    http://purl.obolibrary.org/obo/CHEBI_29309 
1490      1496      proton    http://purl.obolibrary.org/obo/CHEBI_24636 
1544      1548      acid      http://purl.obolibrary.org/obo/CHEBI_37527 
1588      1592      acid      http://purl.obolibrary.org/obo/CHEBI_37527 
1642      1647      group     http://purl.obolibrary.org/obo/CHEBI_24433 
1705      1709      acid      http://purl.obolibrary.org/obo/CHEBI_37527 
1741      1745      acid      http://purl.obolibrary.org/obo/CHEBI_37527 
1841      1844      ion       http://purl.obolibrary.org/obo/CHEBI_24870 
1937      1940      ion       http://purl.obolibrary.org/obo/CHEBI_24870 
953       968       isopropyl ester     http://purl.obolibrary.org/obo/CHEBI_35725 
1536      1548      benzoic acid        http://purl.obolibrary.org/obo/CHEBI_30746 
1578      1592      nicotinic acid      http://purl.obolibrary.org/obo/CHEBI_15940 
1633      1647      carbonyl group      http://purl.obolibrary.org/obo/CHEBI_23019 
1697      1709      benzoic acid        http://purl.obolibrary.org/obo/CHEBI_30746 
1731      1745      nicotinic acid      http://purl.obolibrary.org/obo/CHEBI_15940 

Human Phenotype (HP) Example

Download the ontology:

(cd data; curl -L -O http://purl.obolibrary.org/obo/hp.owl)

Process it:

(cd data; ../produce_data_files.sh hp.owl)

Download an abstract from PubMed, for example 29490421:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" hp

The output should be something like this:

50        53        dry       http://purl.obolibrary.org/obo/PATO_0001801 
348       354       asthma    http://purl.obolibrary.org/obo/HP_0002099 
359       363       COPD      http://purl.obolibrary.org/obo/HP_0006510 
496       500       COPD      http://purl.obolibrary.org/obo/HP_0006510 
504       510       asthma    http://purl.obolibrary.org/obo/HP_0002099

Human Disease Ontology (HDO) Example

Download the ontology:

(cd data; curl -L -O http://purl.obolibrary.org/obo/doid.owl)

Process it:

(cd data; ../produce_data_files.sh doid.owl)

Download an abstract from PubMed, for example 29490421:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" doid

The output should be something like this:

348       354       asthma    http://purl.obolibrary.org/obo/DOID_2841 
359       363       COPD      http://purl.obolibrary.org/obo/DOID_3083 
496       500       COPD      http://purl.obolibrary.org/obo/DOID_3083 
504       510       asthma    http://purl.obolibrary.org/obo/DOID_2841

Ontology for Stem Cell Investigations (OSCI) Example

Download the ontology:

(cd data; curl -L -O https://raw.githubusercontent.com/stemcellontologyresource/OSCI/master/src/ontology/osci.owl)

Process it:

(cd data; ../produce_data_files.sh osci.owl)

Download an abstract from PubMed, for example 30053745:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30053745&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" osci

The output should be something like this:

325     329     cell             http://purl.obolibrary.org/obo/CL_0000000 
447     452     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
475     480     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
531     536     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
545     556     pluripotent      http://purl.obolibrary.org/obo/PATO_0001403 
606     610     cell             http://purl.obolibrary.org/obo/CL_0000000 
691     702     pluripotent      http://purl.obolibrary.org/obo/PATO_0001403 
743     748     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
749     755     neuron           http://purl.obolibrary.org/obo/CL_0000540 
798     804     neuron           http://purl.obolibrary.org/obo/CL_0000540 
925     929     cell             http://purl.obolibrary.org/obo/CL_0000000 
980     984     cell             http://purl.obolibrary.org/obo/CL_0000000 
601     610     Stem cell        http://purl.obolibrary.org/obo/CL_0000034 
920     929     stem cell        http://purl.obolibrary.org/obo/CL_0000034 
975     984     stem cell        http://purl.obolibrary.org/obo/CL_0000034 
913     929     Neural stem cell http://purl.obolibrary.org/obo/CL_0000047 

Cell Ontology (CL) Example

Download the ontology:

(cd data; curl -L -O http://purl.obolibrary.org/obo/cl.owl)

Process it:

(cd data; ../produce_data_files.sh cl.owl)

Download an abstract from PubMed, for example 30053745:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30053745&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" cl

The output should be something like this:

263     268     gyrus            http://purl.obolibrary.org/obo/UBERON_0000200 
276     287     hippocampus      http://purl.obolibrary.org/obo/UBERON_0001954 
325     329     cell             http://purl.obolibrary.org/obo/CARO_0000013 
447     452     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
475     480     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
531     536     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
606     610     cell             http://purl.obolibrary.org/obo/CARO_0000013 
743     748     human            http://purl.obolibrary.org/obo/NCBITaxon_9606 
749     755     neuron           http://purl.obolibrary.org/obo/CL_0000540 
798     804     neuron           http://purl.obolibrary.org/obo/CL_0000540 
925     929     cell             http://purl.obolibrary.org/obo/CARO_0000013 
980     984     cell             http://purl.obolibrary.org/obo/CARO_0000013 
1041    1046    great            http://purl.obolibrary.org/obo/PATO_0000586 
255	268	dentate gyrus	 http://purl.obolibrary.org/obo/UBERON_0001885 
315	329	ependymal cell	 http://purl.obolibrary.org/obo/CL_0000065 
601	610	Stem cell	 http://purl.obolibrary.org/obo/CL_0000034 
920	929	stem cell	 http://purl.obolibrary.org/obo/CL_0000034 
975	984	stem cell	 http://purl.obolibrary.org/obo/CL_0000034 
913	929	Neural stem cell http://purl.obolibrary.org/obo/CL_0000047 

Environmental conditions, treatments and exposures ontology (ECTO) Example

Download the ontology:

(cd data; curl -L -O http://purl.obolibrary.org/obo/ecto.owl)

Process it:

(cd data; ../produce_data_files.sh ecto.owl)

Download an abstract from PubMed, for example 34303912:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=34303912&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" ecto

The output should be something like this:

0	5	Water		http://purl.obolibrary.org/obo/CHEBI_15377 
19	25	liquid		http://purl.obolibrary.org/obo/PATO_0001735 
30	35	human		http://purl.obolibrary.org/obo/NCBITaxon_9606 
120	131	degradation	http://purl.obolibrary.org/obo/GO_0009056 
139	150	environment	http://purl.obolibrary.org/obo/ENVO_01000254 
177	182	water		http://purl.obolibrary.org/obo/CHEBI_15377 
200	206	planet		http://purl.obolibrary.org/obo/ENVO_01000800 
237	242	water		http://purl.obolibrary.org/obo/CHEBI_15377 
243	252	pollution	http://purl.obolibrary.org/obo/ENVO_02500036 
339	350	environment	http://purl.obolibrary.org/obo/ENVO_01000254 
397	404	process		http://purl.obolibrary.org/obo/BFO_0000015 
449	462	concentration	http://purl.obolibrary.org/obo/PATO_0000033 
542	553	agriculture	http://purl.obolibrary.org/obo/ENVO_01001246 
585	591	energy		http://purl.obolibrary.org/obo/ENVO_2000015 
661	665	role		http://purl.obolibrary.org/obo/BFO_0000023 
764	771	quality		http://purl.obolibrary.org/obo/BFO_0000019 
811	821	technology	http://purl.obolibrary.org/obo/NCIT_C17187 
1101    1110    behaviour       http://purl.obolibrary.org/obo/GO_0007610 
1167    1177    technology      http://purl.obolibrary.org/obo/NCIT_C17187 
1192    1198    energy          http://purl.obolibrary.org/obo/ENVO_2000015 
1412    1422    technology      http://purl.obolibrary.org/obo/NCIT_C17187 
237     252     water pollution http://purl.obolibrary.org/obo/ENVO_02500039

Environment Ontology (ENVO) Example

Download the ontology:

(cd data; curl -L -O http://purl.obolibrary.org/obo/envo.owl)

Process it:

(cd data; ../produce_data_files.sh envo.owl)

Download an abstract from PubMed, for example 34303912:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=34303912&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" envo

The output should be something like this:

0	5	Water		http://purl.obolibrary.org/obo/CHEBI_15377 
30	35	human		http://purl.obolibrary.org/obo/NCBITaxon_9606 
139	150	environment	http://purl.obolibrary.org/obo/ENVO_01000254 
177	182	water		http://purl.obolibrary.org/obo/CHEBI_15377 
200	206	planet		http://purl.obolibrary.org/obo/ENVO_01000800 
237	242	water		http://purl.obolibrary.org/obo/CHEBI_15377 
243	252	pollution	http://purl.obolibrary.org/obo/ENVO_02500036 
339	350	environment	http://purl.obolibrary.org/obo/ENVO_01000254 
397	404	process		http://purl.obolibrary.org/obo/BFO_0000015 
449	462	concentration	http://purl.obolibrary.org/obo/PATO_0000033 
542	553	agriculture	http://purl.obolibrary.org/obo/ENVO_01001246 
585	591	energy		http://purl.obolibrary.org/obo/ENVO_2000015 
661	665	role		http://purl.obolibrary.org/obo/BFO_0000023 
764	771	quality		http://purl.obolibrary.org/obo/BFO_0000019 
1192	1198	energy		http://purl.obolibrary.org/obo/ENVO_2000015 
237	252	water pollution	http://purl.obolibrary.org/obo/ENVO_02500039 

Radiology Lexicon (RadLex) Example

Find the link to the RDF/XML version from http://bioportal.bioontology.org/ontologies/RADLEX

Download it as radlex.rdf:

(cd data; curl -L -o radlex.rdf https://data.bioontology.org/ontologies/RADLEX/download?apikey=...&download_format=rdf)

Process it:

(cd data; ../produce_data_files.sh radlex.rdf)

Download an abstract from PubMed, for example 29490421:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" radlex

The output should be something like this:

337       344       therapy   http://radlex.org/RID/RID8 
348       354       asthma    http://radlex.org/RID/RID5327 
359       363       COPD      http://radlex.org/RID/RID5317 
468       475       therapy   http://radlex.org/RID/RID8 
496       500       COPD      http://radlex.org/RID/RID5317 
504       510       asthma    http://radlex.org/RID/RID5327 
511       518       patient   http://radlex.org/RID/RID49815 
587       594       patient   http://radlex.org/RID/RID49815 

DeCS Multilingual Example

Request the XML files of DeCS in Portuguese, Spanish and English from https://decs.bvsalud.org/

Process it:

(cd data; ../produce_data_files.sh bireme_decs_eng2020.xml)
(cd data; ../produce_data_files.sh bireme_decs_spa2020.xml)
(cd data; ../produce_data_files.sh bireme_decs_por2020.xml)

Download a multilingual corpus, e.g. from https://sites.google.com/view/felipe-soares/datasets

text_eng='This situation also contributes to respiratory aspiration and aspiration pneumonia, which is evidenced by a persistent cough with sputum or by other signs such as fever, tachypnea, or focal consolidation confirmed by radiographic imaging'
text_spa='Esa situación también contribuye para la aspiración respiratoria y para la neumonía espirativa, que puede ser evidenciada por tos persistente con expectoración o por otras señales como fiebre, taquipnea, consolidación focal, siendo confirmada por la imagen radiográfica'
text_por='Essa situação também contribui para a aspiração respiratória e para a pneumonia aspirativa, que pode ser evidenciada por tosse persistente com expectoração ou por outros sinais como: febre, taquipneia, consolidação focal, sendo confirmada pela imagem radiográfica'

Recognize the entities in the abstract:

./get_entities.sh "$text_eng" bireme_decs_eng2020
./get_entities.sh "$text_spa" bireme_decs_spa2020
./get_entities.sh "$text_por" bireme_decs_por2020

The output should be something like this:

73        82        pneumonia                     https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014
119       124       cough                         https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371
130       136       sputum                        https://decs.bvsalud.org/ths/?filter=ths_regid&q=D013183
163       168       fever                         https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334
170       179       tachypnea                     https://decs.bvsalud.org/ths/?filter=ths_regid&q=D059246
35        57        respiratory aspiration        https://decs.bvsalud.org/ths/?filter=ths_regid&q=D053120
62        82        aspiration pneumonia          https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011015

75        83        neumonía                      https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014
126       129       tos                           https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371
185       191       fiebre                        https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334
193       202       taquipnea                     https://decs.bvsalud.org/ths/?filter=ths_regid&q=D059246
41        64        aspiración respiratoria       https://decs.bvsalud.org/ths/?filter=ths_regid&q=D053120

70        79        pneumonia                     https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014
121       126       tosse                         https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371
170       176       sinais                        https://decs.bvsalud.org/ths/?filter=ths_regid&q=D012816
183       188       febre                         https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334
190       200       taquipneia                    https://decs.bvsalud.org/ths/?filter=ths_regid&q=D059246
38        60        aspiração respiratória        https://decs.bvsalud.org/ths/?filter=ths_regid&q=D053120
70        90        pneumonia aspirativa          https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011015

WordNet Example

Download the ontology:

(cd data; curl -L -O http://www.w3.org/2006/03/wn/wn20/rdf/wordnet-hyponym.rdf)

Process it:

(cd data; ../produce_data_files.sh wordnet-hyponym.rdf)

Download an abstract from PubMed, for example 29490421:

text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)

Recognize the entities in the abstract:

./get_entities.sh "$text" wordnet-hyponym

The output should be something like this:

4       11      article
50      53      dry
54      60      powder
91      95      data
105     115     literature
192     196     well
288     291     may
292     301     influence
306     314     efficacy
319     325     safety
329     336     inhaler
337     344     therapy
348     354     asthma
381     384     use
396     404     addition
423     432     potential
448     456     efficacy
460     467     inhaler
468     475     therapy
485     491     doctor
504     510     asthma
511     518     patient
519     530     perspective
556     562     choice
563     572     algorithm
587     594     patient

Processed Lexicons

cd data
curl -L -O http://labs.rd.ciencias.ulisboa.pt/mer/data/lexicons202302.tgz
tar -xzf lexicons202302.tgz
cd ..

BioCreative V.5 Challenge Evaluation (BeCalm) Lexicons

cd data
curl -L -O http://labs.rd.ciencias.ulisboa.pt/mer/data/becalm2017.tgz
tar -xzf data2017.tgz
tar -tzf data2017.tgz | xargs -l ../produce_data_files.sh
cd ..
./get_entities.sh 'heart' tissue_and_organ
./get_entities.sh 'histoglobin' protein
./get_entities.sh 'ame-miR-2b' mirna

Similarity

First install DiShIn: https://github.com/lasigeBioTM/DiShIn Or a minimalist version:

curl -O http://labs.rd.ciencias.ulisboa.pt/dishin/dishin.py
curl -O http://labs.rd.ciencias.ulisboa.pt/dishin/ssm.py

Before executing the get_similarity script you need to select the following parameters:

  • Measure: Resnik, Lin or JC
  • Type: MICA or DiShIn
  • Path: DiShIn installation folder
  • Database: DiShIn db file with the ontology, e.g. chebi.db, go.db, hp.db, doid.db, radlex.db, or wordnet.db

For example, download the database for ChEBI:

curl -O http://labs.rd.ciencias.ulisboa.pt/dishin/chebi202302.db.gz
gunzip -N chebi202302.db.gz

Then, just execute the get_similarity script using the output of the get_entities script

./get_entities.sh "α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide" lexicon | ./get_similarity.sh Lin DiShIn . chebi.db

The output now includes for each match the most similar term and its similarity:

0       9       α-maltose                       http://purl.obolibrary.org/obo/CHEBI_18167      CHEBI_15763     0.02834388514184269
14      28      nicotinic acid                  http://purl.obolibrary.org/obo/CHEBI_15940      CHEBI_15763     0.07402224403263755
48      62      nicotinic acid                  http://purl.obolibrary.org/obo/CHEBI_15940      CHEBI_15763     0.07402224403263755
48      79      nicotinic acid D-ribonucleotide http://purl.obolibrary.org/obo/CHEBI_15763      CHEBI_15940     0.07402224403263755

A multilingual example:

curl -O http://labs.rd.ciencias.ulisboa.pt/dishin/mesh202302.db.gz
gunzip -N mesh202302.db.gz
curl -L -O http://labs.rd.ciencias.ulisboa.pt/mer/data/lexicons202302.tgz
(cd data; tar -xzf ../lexicons202302.tgz --wildcards bireme_decs_por2020*)
./get_entities.sh "desmaio, tontura, pneumonia e tosse" bireme_decs_por2020 | ./get_similarity.sh Lin DiShIn . mesh.db

The output:

0         7         desmaio   https://decs.bvsalud.org/ths/?filter=ths_regid&q=D013575    D004244   0.34311804633118076
9         16        tontura   https://decs.bvsalud.org/ths/?filter=ths_regid&q=D004244    D013575   0.34311804633118076
18        27        pneumonia https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014    D003371   0.429433074088733
30        35        tosse     https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371    D011014   0.429433074088733

As expected, fainting (desmaio) is closer to dizziness (tontura), and pneumonia is closer to cough (tosse).

About

Minimal Named-Entity Recognizer (MER)

http://labs.fc.ul.pt/mer/


Languages

Language:Shell 65.3%Language:Python 30.2%Language:Dockerfile 4.5%