Rubra: a bioinformatics pipeline. --------------------------------- https://github.com/bjpop/rubra License: -------- Rubra is licensed under the MIT license. See LICENSE.txt. Description: ------------ Rubra is a pipeline system for bioinformatics workflows. It is built on top of the Ruffus (http://www.ruffus.org.uk/) Python library, and adds support for running pipeline stages on a distributed compute cluster. Authors: -------- Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield Installation: ------------- To install, clone this repository and run `setup.py`: git clone https://github.com/bjpop/rubra cd rubra python setup.py install If you are on a system where you do not have administrative privileges, we suggest using virtualenv ( http://www.virtualenv.org/ ). On HPC systems you may find virtualenv is already installed. Usage: ------ usage: rubra [-h] PIPELINE_FILE --config CONFIG_FILE [CONFIG_FILE ...] [--verbose {0,1,2}] [--style {print,run,touchfiles,flowchart}] [--force TASKNAME] [--end TASKNAME] [--rebuild {fromstart,fromend}] A bioinformatics pipeline system. optional arguments: -h, --help show this help message and exit PIPELINE_FILE Your Ruffus pipeline stages (a Python module) --config CONFIG_FILE [CONFIG_FILE ...] One or more configuration files (Python modules) --verbose {0,1,2} Output verbosity level: 0 = quiet; 1 = normal; 2 = chatty (default is 1) --style {print,run,touchfiles,flowchart} Pipeline behaviour: print; run; touchfiles; flowchart (default is print) --force TASKNAME tasks which are forced to be out of date regardless of timestamps --end TASKNAME end points (tasks) for the pipeline --rebuild {fromstart,fromend} rebuild outputs by working back from end tasks or forwards from start tasks (default is fromstart) Example: -------- Below is a little example pipeline which you can find in the Rubra source tree. It counts the number of lines in two files (test/data1.txt and test/data2.txt), and then sums the results together. rubra example_pipeline.py --config example_config.py --style run There are 2 lines in the first file and 1 line in the second file. So the result is 3, which is written to the output file test/total.txt. The --pipeline argument is a Python script which contains the actual code for each pipeline stage (using Ruffus notation). The --config argument is a Python script which contains configuration options for the whole pipeline, plus options for each stage (including the shell command to run in the stage). The --style argument says what to do with the pipeline: "run" means "perform the out-of-date steps in the pipeline". The default style is "print" which just displays what the pipeline would do if it were run. You can get a diagram of the pipeline using the "flowchart" style. You can touch all files in order using the "touchfiles" style, which is mostly useful for forcing Ruffus to acknowledge that a set of steps is up to date. Configuration: -------------- Configuration options are written into one or more Python scripts, which are passed to Rubra via the --config command line argument. Some options are required, and some are, well, optional. Options for the whole pipeline: ------------------------------- pipeline = { "logDir": "log", "logFile": "pipeline.log", "procs": 2, "end": ["total"], } Options for each stage of the pipeline: --------------------------------------- stageDefaults = { "distributed": False, "walltime": "00:10:00", "memInGB": 1, "queue": "batch", "modules": ["python-gcc"] } stages = { "countLines": { "command": "wc -l %file > %out", }, "total": { "command": "./test/total.py %files > %out", }, }