Project aims to collect a literature corpus as our training and testing data with automated or manual labeled entities, from abstracts in the arthropod sciences.
And then, Using SpaCy NLP and computational linguistics algorithms to make inferences and gain insights about data we have
-
TeamTat - a collaborative text annotation and web-based annotation local setup available tool,
equipped to manage team annotation projects engagingly and efficiently. -
Spacy - a free, open-source library for advanced Natural Language Processing (NLP) in Python.
designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
Python
Anaconda
and Jupyter Notebook
Python 3.3 or greater is required to install the Jupyter Notebook.
- Anaconda and Jupyter Notebook Install Instructions - Windows
- How to install Python 3.6 and run the Spyder Integrated Development Environment (IDE) or the Jupyter Notebook. Vedio
The BioCxml format of the TeamTat's annotation
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection>
<source>PubTator</source>
<date/>
<key>BioC.key</key>
<document>
<id>3392027</id>
<infon key="tt_curatable">no</infon>
<infon key="tt_version">0</infon>
<infon key="tt_round">0</infon>
<passage>
<infon key="type">title</infon>
<offset>0</offset>
<text>(title)Primary structure of apolipophorin-III from the migratory locust, Locusta migratoria. Potential amphipathic structures and molecular evolution of an insect apolipoprotein.</text>
<annotation id="1">
<infon key="identifier"></infon>
<infon key="type">Gene</infon>
<infon key="updated_at">1980-01-01T00:00:00Z</infon>
<location offset="21" length="17"/>
<text>apolipophorin-III</text>
</annotation>
<annotation id="2">
<infon key="identifier">7004</infon>
<infon key="type">Species</infon>
<infon key="updated_at">1980-01-01T00:00:00Z</infon>
<location offset="48" length="16"/>
<text>migratory locust</text>
</annotation>
</passage>
</document>
</collection>
The Spacy's entity annotations format
train_data = [
('Primary structure of ....',{'entities': [(21,38,'Gene'),(48,64,'Species'),(66,84,'Species'),(225,242,'Gene'),(248,266,'Species'),(423,440,'Gene'),(450,466,'Species'),(468,481,'Species'),(597,610,'Species'),(969,978,'Species'),(1159,1168,'Species'),(1234,1243,'Species')]}),(ex2),(ex3)]
#("Eample text or content"(string), {"entities": [(start_position(int), end_position(int), "label_name"(string))]})
- Python 3.x { x > 4 }
check by commandpython --version
- Pip (package manager)
check by commandpip --version
- lxml module
Install by commandpip install lxml
check by commandpip list
to see whether lxml exists
- Step1 - Clone the repo to local path
cd `Your path`
git clone https://github.com/ShangYuChiang/NER.git
- Step2 - Run BioCxml2spacy.py
cd / NER/BioCxml2spacy
python BioCxml2spacy.py
- Step3 - The results are shown in the output file
- Follow the README.md tutorial at NER_GS Github
- Using jupyter notebook to open file spaCy_NER.ipynb
Convert annotation format from BioCxml to Spacy Github Link
Convert text file into BioCxml format Github Link
Using PySysrev Gene Annotations dataset NER with PySysrev dataset.ipynb
XML Tutorial : w3schools.com
Spacy : Training spaCy’s Statistical Models
Github : BioC-JSON , Spacy-ner-annotator