cd $HOME
git clone https://<username>@github.com/SciCrunch/Foundry-ES.git
cd $HOME/Foundry-ES
This project requires Java 1.8 or higher to build.
Before you start the build process, you need to install three libraries
from dependencies
directory to your local maven repository
cd $HOME/Foundry_ES/dependencies
./install_prov_xml_2mvn.sh
./install_prov_model_2mvn.sh
./install_prov_json__2mvn.sh
Afterwards run the following maven command in the $HOME/Foundry_ES
mvn -Pdev clean install
Here dev profile is used. There are production prod
and dev
profiles for differrent configurations for development and production environments.
Foundry uses several directories to dynamically load enhancer plugins, to cache ingestion data and save its logs which needs to exists and be readable and writable before running the system. You can create them via
mkdir -p /var/data/foundry-es/foundry_plugins/plugins /var/data/foundry-es/cache /var/data/logs /var/data/foundry-es/cache/data /var/data/foundry-es/cache/data/staging
For the consumers
subproject, you need to
copy consumer.properties.example
in $HOME/Foundry-ES/consumers//src/main/resources/dev
(for dev profile) or in $HOME/Foundry-ES/consumers//src/main/resources/prod
(for prod profile) to consumer.properties
file. Given that the cache directories /var/data/foundry-es/cache/data
and /var/data/foundry-es/cache/data/staging
are created as instructed in previous paragraph, you don't need to
change consumer.properties
file for the common operation of the system.
The configuration files are located under each sub-project. For example,
the configuration files for the dispatcher component are located under
$HOME/Foundry-ES/dispatcher/src/main/resources
.
$HOME/Foundry-ES/dispatcher/src/main/resources
├── dev
│ └── dispatcher-cfg.xml
└── prod
└── dispatcher-cfg.xml
When you use -Pdev
argument, configuration file from the dev
directory is
included in the jar file.
All subsystem configuration files are generated from a master configuration file in YAML format.
An example master configuration file can be found at $HOME/Foundry-ES/bin/config.yml.example
.
Once you create a master config file named say config.yml
run the following to generate all configuration files for the subsystems (for dev profile)
cd $HOME/Foundry-ES/bin
./config_gen.sh -c config.yml -f $HOME/Foundry-ES -p dev
./config_gen.sh -h
usage: ConfigGenerator
-c <cfg-spec-file> Full path to the Foundry-ES config spec YAML
file
-f <foundry-es-root-dir>
-h print this message
-p <profile> Maven profile ([dev]|prod)
After each configuration file generation you need to run maven to move the configs to their target locations
mvn -Pdev install
The system uses MongoDB as its backend. Both 2.x and 3.x versions of MongoDB are tested with the system. If you are using MongoDB 3.x, preferred storage engine is wiredTiger.
-
Download and unpack Apache ActiveMQ 5.10.0 Release to a directory of your choosing (
$MQ_HOME
). -
To start message queue server at default port
61616
, go to$MQ_HOME/bin
directory and run
activemq start
- To stop the activemq server
activemq stop
The system consists of a dispatcher, a consumer head and a CLI manager interface. The dispatcher listens to the MongoDB changes and using its configured workflow dispatches messages to the message queue for the listening consumer head(s). The consumer head coordinates a set of configured consumers that do a prefined operation of a document indicated by the message they receive from the dispatcher and ingestors. The ingestors are specialized consumers that are responsible for the retrieval of the original data as configured by harvest descriptor JSON file of the corresponding source. They are triggered by the manager application.
Before any processing the MongoDB needs to be populated with the (re)source descriptors using
the $HOME/Foundry_ES/bin/ingest_src_cli.sh
.
./ingest_src_cli.sh -h
usage: SourceIngestorCLI
-c <config-file> config-file e.g. ingestor-cfg.xml (default)
-d delete the source given by the source-json-file
-h print this message
-j <source-json-file> harvest source description file
-u update the source given by the source-json-file
Resource descriptors are generated from source description configuration YAML file
(See $HOME/Foundry_ES/bin/source-desc-cfg.yml.example
for an example).
Once the transformation script is finalized for a resource, its resource descriptor
JSON file needs to be regenerated. Also for new resources a new resource descriptor
JSON file is needed. A resource descriptor JSON file is generated via the $HOME/Foundry_ES/bin/source_desc_gen.sh
script.
./source_desc_gen.sh
usage: SourceDescFileGeneratorCLI
-s <source> source name (top level element) in the
source-descriptor-cfg-file [e.g. pdb, dryad]
-c <source-descriptor-cfg-file> Full path to the source descriptor
config params YAML file
-h print this message
where the resource descriptor config params are read from a YAML file. There is an example YAML file ($HOME/Foundry_ES/bin/source-desc-cfg.yml.example
). You can copy
it to source-desc-cfg.yml
file and change the paths (for transformation script files) there to match your local system.
An example resource (VectorBase) with a sample of its raw data and corresponding
transformation file is included in $HOME/Foundry_ES/example
directory. The resource
descriptor configuration for the example is in the file $HOME/Foundry_ES/bin/source-desc-example-cfg.yml
.
You need to edit this file to adjust the absolute paths for the fields ingestURL
and transformationScript
for your Foundry_ES installation directory.
Afterwards run the following in the $HOME/Foundry_ES
directory.
./source_desc_gen.sh -s vectorbase -c source-desc-example-cfg.yml
The generated resource descriptor file is written to /tmp
directory.
To insert the generated source descriptor document to the sources
MongoDB collection, use the following
./ingest_src_cli.sh -j /tmp/vectorbase.json
The script for the dispatcher component dispatcher.sh
is located in
$HOME/Foundry_ES/bin
. By default it uses dispatcher-cfg.xml
file for the
profile specified during the build. This needs to run in its own process.
To stop it, use Ctrl-C
.
./dispatcher.sh -h
usage: Dispatcher
-c <config-file> config-file e.g. dispatcher-cfg.xml (default)
-h print this message
The script for the consumer head component consumer_head.sh
is located in
$HOME/Foundry_ES/bin
. By default it uses consumers-cfg.xml
file for the
profile specified during the build. For production use you need to specify
-f
option. This needs to run in its own process. To stop it, use Ctrl-C
.
./consumer_head.sh -h
usage: ConsumerCoordinator
-c <config-file> config-file e.g. consumers-cfg.xml (default)
-cm run in consumer mode (no ingestors)
-f full data set default is 100 documents
-h print this message
-n <max number of docs> Max number of documents to ingest
-p send provenance data to prov server
-t run ingestors in test mode
Manager is an interactive command line application for sending ingestion start messages for resources to the consumer head(s).
It also have some convenience functions to cleanup MongoDB data for a given resource and delete ElasticSearch indices.
By default manager app uses dispatcher-cfg.xml
file for the profile specified
during the build.
./manager.sh -h
usage: ManagementService
-c <config-file> config-file e.g. dispatcher-cfg.xml (default)
-h print this message
Foundry:>> help
Available commands
help - shows this message.
ingest <sourceID>
h - show all command history
delete <url> - [e.g. http://52.32.231.227:9200/geo_20151106]
dd <sourceID> - delete docs for a sourceID
cdup <sourceID> - clean duplicate files from GridFS for a sourceID
trigger <sourceID> <status-2-match> <queue-2-send> [<new-status> [<new-out-status>]] (e.g. trigger nif-0000-00135 new.1 foundry.uuid.1)
run <sourceID> status:<status-2-match> step:<step-name> [on|to_end] (e.g. run nif-0000-00135 status:new.1 step:transform)
index <sourceID> <status-2-match> <url> (e.g. index biocaddie-0006 transformed.1 http://52.32.231.227:9200/geo_20151106/dataset)
list - lists all of the existing sources.
status [<sourceID>] - show processing status of data source(s)
ws - show configured workflow(s)
exit - exits the management client.
Foundry:>>
To ingest the previously registered VectorBase resource, start manager.sh
, the command line management interface to the Foundry system. You can list, ingest, delete and index resources through this interface.
./manager.sh
Foundry:>> list
vectorbase - VectorBase
Foundry:>> ingest vectorbase
Do you want to ingest records for vectorbase? (y/[n])? y
...
Foundry:>> status vectorbase
vectorbase in_process total: 99 finished: 0 error: 0