python cli processing pipeline for processing ontology data.
process.Process
is the main entry point for the application. process.py
is a convince wrapper script for running the
app from the source tree.
The processing pipeline implements the following steps:
-
Pre-Processing
This step involves transforming the data into a common format for triplifying. This will usually involve writing a custom
PreProcessor
for each project to be ingested. Thepreprocessor
module contains an abstract classAbstractPreProcessor
that can be inherited by the project preprocessor. -
Triplifier
This step provides provides basic data validation and generates the RDF triples, assuming validation passes, needed for the reasoning phase. Each project will need to contain a
config
directory with the following files that will be used to triplify the preprocessed data:NOTE: Wherever there is a uri expressed in any of the following files, you have the option of using ontology label substitution If the uri is of the format
{label name here}
, the appropriate uri will be substituted from the provided ontology -
Reasoning
This step uses the ontopilot project to perform inferencing using the Plant Phenology Ontology
-
Rdf2Csv
This step takes the provided sparql query and generates csv files for each file outputted by in the Reasoning step. If no sparql query is found, then this step is skipped.
-
Data Loading
This is a separate cli used for loading reasoned data into elasticsearch and/or blazegraph.
loader.loader
is the main entry point for the application.loader.py
is a convince wrapper script for running the app from the source tree.-
Uploading
- BlazeGraph
- ElasticSearch
-
Reccomended running python version 3.5.1 , installed using pyenv MacOSX installation instructions
The python dependencies are found in requirements.txt
. These can be installed by running
pip install -r requirements.txt --user
- Java 8
- ontopilot (Will be propted to download during cli exectuion if not found)
- query_fetcher (Will be propted to download during cli exectuion if not found)
./pytest.sh
Before running the processing script, you will likely need to fetch data. Some of the projects likely will have an API that you can obtain data from and these are written into a data_fetcher.py script for NPN and NEON. PEP725 data needs mysql tables to be extracted manually:
python ./projects/npn/data_fetcher.py data/npn/input/
python ./projects/neon/data_fetcher.py data/neon/input/
NOTE: when updating data, we currently need to manually update the citation and data usage policy file with the date of load, which is found at the following location (after updating this file we need to git pull changes in the ppo-data-server repository):
https://raw.githubusercontent.com/biocodellc/ppo-data-server/master/citation_and_data_use_policies.txt
Running from the process.py script:
$ python process.py --help
usage: process.py [-h] (--input_dir INPUT_DIR | --data_file DATA_FILE)
[--config_dir CONFIG_DIR] [--ontology ONTOLOGY]
[--preprocessor PREPROCESSOR] [--drop_invalid] [--log_file]
[--reasoner_config REASONER_CONFIG] [-v] [-c CHUNK_SIZE]
[--num_processes NUM_PROCESSES] [-s SPLIT_DATA_COLUMN]
project output_dir
PPO data pipeline cmd line application.
positional arguments:
project This is the name of the directory containing the
project specific files. All project config
directoriesmust be placed in the `projects` directory.
output_dir path of the directory to place the processed data
optional arguments:
-h, --help show this help message and exit
--input_dir INPUT_DIR
path of the directory containing the data to process
--data_file DATA_FILE
optionally specify the data file to load. This will
skip the preprocessor step and used the supplied data
file instead
--config_dir CONFIG_DIR
optionally specify the path of the directory
containing the configuration files. defaults to
/Users/rjewing/code/biocode/ppo-data-
pipeline/process/../config
--ontology ONTOLOGY optionally specify a filepath/url of the ontology to
use for reasoning/triplifying
--preprocessor PREPROCESSOR
optionally specify the dotted python path of the
preprocessor class. This will be loaded instead of
looking for a PreProcessor in the supplied project
directory. Ex: projects.asu.proprocessor.PreProcessor
--drop_invalid Drop any data that does not pass validation, log the
results, and continue the process
--log_file log all output to a log.txt file in the output_dir.
default is to log output to the console
--reasoner_config REASONER_CONFIG
optionally specify the reasoner configuration file
-v, --verbose verbose logging output
-c CHUNK_SIZE, --chunk_size CHUNK_SIZE
chunk size to use when processing data. optimal
chunk_size for datasets with less then 200000
recordscan be determined with: num_records / num_cpus
--num_processes NUM_PROCESSES
number of process to use for parallel processing of
data. Defaults to cpu_count of the machine
-s SPLIT_DATA_COLUMN, --split_data SPLIT_DATA_COLUMN
column to split the data on. This will split the data
file into many files with each file containing no more
records then the specified chunk_size, using the
specified column values as the filenames
As an alternative to the commandline, params can be placed in a file, one per
line, and specified on the commandline like 'process.py @params.conf'.
Examples of loading the processing script which will run the pre-processor and all dependencies, specifying a local copy of the PPO ontology using nohup and running background:
nohup python process.py --ontology file:/vol_d/ppo-data-pipeline/config/ppo.owl --input_dir data/npn/input/ --drop_invalid npn data/npn/output/ &
nohup python process.py --ontology file:/vol_d/ppo-data-pipeline/config/ppo.owl --input_dir data/neon/input/ --drop_invalid neon data/neon/output/ &
nohup python process.py --ontology file:/vol_d/ppo-data-pipeline/config/ppo.owl --input_dir data/pep725/input/ --drop_invalid pep725 data/pep725/output/ &
Running the loader.py script:
16:38 $ python loader.py --help
usage: loader.py [-h] [-rdf_i--rdf_input_dir RDF_I__RDF_INPUT_DIR]
[--endpoint ENDPOINT] [-es_ies_input_dir ES_IES_INPUT_DIR]
[--index INDEX] [--drop-existing] [--alias ALIAS]
{both,blazegraph,elasticsearch}
data loading cmd line application for PPO data pipeline.
positional arguments:
{both,blazegraph,elasticsearch}
optional arguments:
-h, --help show this help message and exit
blazegraph:
blazegraph loading options
-rdf_i--rdf_input_dir RDF_I__RDF_INPUT_DIR
The path of the directory containing the rdf data to
upload to blazegraph
--endpoint ENDPOINT the blazegraph endpoint to upload to. The namespace
will be the name of the uploadedfile minus the
extension
elastic_search:
elastic_search loading options
-es_ies_input_dir ES_IES_INPUT_DIR
The path of the directory containing the csv data to
upload to elasticsearch
--index INDEX The name elasticsearch of the index to upload to
--drop-existing this flag will drop all existing data with the same
"source" value.
--alias ALIAS optionally specify an elastic search alias. When
creating an index, it will be associatedwith this
alias
As an alternative to the commandline, params can be placed in a file, one per
line, and specified on the commandline like 'loader.py @params.conf'.
An example of running the loading script (ensure proper IP access to tarly.cyverse.org):
python loader.py --es_input_dir data/npn/output/output_reasoned_csv/ --index npn --drop-existing --alias ppo --host tarly.cyverse.org:80 elasticsearch
python loader.py --es_input_dir data/neon/output/output_reasoned_csv/ --index neon --drop-existing --alias ppo --host tarly.cyverse.org:80 elasticsearch
python loader.py --es_input_dir data/pep725/output/output_reasoned_csv/ --index pep725 --drop-existing --alias ppo --host tarly.cyverse.org:80 elasticsearch
We provide a set of default configuration files found under config
directory. These are the base configuration files
we use for reasoning against the Plant Phenology Ontology. These files
configure the data validation, triplifying, reasoning, and rdf2csv converting.
The following files are required:
-
entity.csv
- This file specifies the entities to create when triplifying. The file expects the following columns:-
alias
The name used to refer to the entity
-
concept_uri
The uri which defines this entity
-
unique_key
The column used to uniquely identify the entity
-
identifier_root
The identifier root for each unique entity. This is typically an BCID identifier
-
-
mapping.csv
-
column
The name of the column in the csv file to be used for triplifying
-
uri
The uri which defines this column
-
entity_alias
The alias of the entity this column is a property of
-
-
relations.csv
-
subject_entity_alias
The alias of the entity which is the subject of this relationship
-
predicate
The uri which defines the relationship
-
object_entity_alias
The alias of the entity which is the object of this relationship
-
-
-
field
The name of the field in the input csv file
-
defined_by
The uri which defines the field
-
-
excluded_types.csv
- Used by ontopilot -
reasoner.conf
- ontopilot inferencing configuration file -
headers.csv
- specifies the input data headers we except to see after preprocessing the data
The following files are optional:
-
rules.csv
- This file is used to setup basic validation rules for the data. The file expects the following columns:-
rule
The name of the validation rule to apply. See rule types below. Note: a default
ControlledVocabulary
rule will be applied to thephenophase_name
column for the names found in the phenophase_descriptions.csv file -
columns
Pipe
|
delimited list of columns to apply the rule to -
level
Either
WARNING
orERROR
.ERROR
will terminate the program after validation.WARNINGS
will be logged. Case-Insensitive. Defaults toWARNING
-
list
Only applicable for
ControlledVocabulary
rules. This refers to the name of the file that contains the list of the controlled vocab
RequiredValue
- Specifies columns which can not be emptyUniqueValue
- Checks that the values in a column are uniqueControlledVocabulary
- Checks columns against a list of controlled vocabulary. The name of the list is specified in thelist
column inrules.csv
Integer
- Checks that all values are integers. Will coerce values to integers if possibleFloat
- Checks that all values are floating point numbers (ex. 1.00). Will coerce values to floats if possible
-
-
Any file specified in
rules.csv
list
column is required. The file expects the following columns:field
- Specifies a valid value. This is the values expected in the input data filedefined_by
- Optional value which will replace the field when writing triples
-
fetch_reasoned.sparql
- Sparql query used to convert reasoned data to csv