CWL Tutorial Short Read Assembler DRAFT

This tutorial shows how to create a cwl based biobox of your tool. The procedure consists of four steps.

1.Dockerize your Tool

2.Create a CWL file for your tool.

3.Implement the Bioboxes standard: Rename megahit output.

4.Implement the Bioboxes standard: Put your CWL files together.

Please install first cwl tool by following the installation instructions and clone this repository:

git clone https://github.com/pbelmann/cwl_tutorial.git

The first step is to integrate your tool into a Docker container. There are multiple tutorials online available for dockerizing your tool. The most common way is to write a Dockerfile and make it available on DockerHub.
In the second step you should write a cwl workflow just for your tool. In the following steps we will use the assembler interface as an example and megahit as a an example assembler (megahist_core.cwl):

cwlVersion: cwl:v1.0
class: CommandLineTool
hints:
 DockerRequirement:
   dockerPull: quay.io/biocontainers/megahit:1.1.1--py36_0

inputs:
 fastq:
   type: File[]
   label: interleaved & gzipped fasta/q paired-end files
   inputBinding:
     prefix: --12
     itemSeparator: ','
baseCommand: megahit

arguments:
 - prefix: --out-dir
   valueFrom: $(runtime.outdir)/output

outputs:
  megahit_contigs:
    type: File
    format: edam:format_1929  # FASTA
    outputBinding:
      glob: 'output/final.contigs.fa'
  stderr: stderr

The megahit assembler accepts a list of fastq as input and produces fasta as output. In this workflow we are using the megahit biocontainer which is a collection of containerized bioinformatics software. This workflow can be executed directly with the following command:

Example fastq file is also in this repository (reads.fq.gz).

cwltool  megahit_core.cwl --fastq reads.fq.gz

In the next step you will have to look up the input and output definitions for your workflow (https://github.com/bioboxes/rfc) As mentioned in the previous step will use the assembler interface that needs a fastq file as an input and fasta file as output. The rfc of the short read assembler demands a fasta file as output with the name contigs.fa. This means that the output file must be renamed. This can be done with the following cwl snippet (move.cwl):

cwlVersion: v1.0
class: CommandLineTool
baseCommand: mv
inputs:
  infile:
    type: File
    inputBinding:
      position: 1
  outfile:
    type: string
    inputBinding:
      position: 2
outputs:
  out:
    type: File
    outputBinding:
      glob: $(inputs.outfile)

This is again a cwl snippet which is indepent of other cwl files:

cwltool move.cwl  --infile test.txt --outfile test2.txt

Now you have to combine this two tool descriptions in the following workflow:

cwlVersion: cwl:v1.0
class: Workflow
requirements:
  - class: StepInputExpressionRequirement
inputs:
  dataset:
    type: File[]
outputs:
  result:
    type: File
    outputSource: rename/out
steps:
  megahit:
    run: megahit_core.cwl
    in:
      fastq: dataset
    out:
      - megahit_contigs
  rename:
    run: move.cwl
    in:
      infile: megahit/megahit_contigs
      outfile:
        valueFrom: "contigs.fa"
    out:
      - out

This cwl file references megahit cwl description and the cwl version of the move command. For megahit it just passes the fastq input parameter to the workflow and renames the megahit output to contigs.fa which is requested by the rfc.

cwltool  megahit.cwl --fastq reads.fq.gz

pbelmann / cwl_tutorial

CWL Tutorial Short Read Assembler DRAFT

About

Languages