BovReg / nf-core-faang-wksh20

nf-core training material for FAANG workshop February 2020

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nf-core tutorial

Materials for the shared FAANG workshop taking place on the 26th of February 2020 in the Wellcome Genome Campus, Hinxton, Cambridge, UK. "nf-core: A community-driven collection of omics portable pipelines"

Material adapted from the nf-core tutorial

Duration: 1hr 45

Table of Contents

Sections

I think we can introduce at the and that you can either create or contribute to the nf-core pipelines but not go through this part of the tutorial

Exercises

I think we can introduce at the and that you can either create or contribute to the nf-core pipelines but not go through this part of the tutorial

Abstract

The nf-core community provides a range of tools to help new users get to grips with nextflow - both by providing complete pipelines that can be used out of the box, and also by helping developers with best practices. Companion tools can create a bare-bones pipeline from a template scattered with TODO pointers and CI with linting tools check code quality. Guidelines and documentation help to get nextflow newbies on their feet in no time. Best of all, the nf-core community is always on hand to help.

In this tutorial we discuss the best-practice guidelines developed by the nf-core community, why they're important and give insight into the best tips and tricks for budding nextflow pipeline developers. ✨

Introduction

What is nf-core

nf-core is a community-led project to develop a set of best-practice pipelines built using Nextflow. Pipelines are governed by a set of guidelines, enforced by community code reviews and automatic linting (code testing). A suite of helper tools aim to help people run and develop pipelines.

What this tutorial will cover

This tutorial attempts to give an overview of how nf-core works: how to run nf-core pipelines, how to make new pipelines using the nf-core template and how nf-core pipelines are reviewed and ultimately released.

Where to get help

The beauty of nf-core is that there is lots of help on offer! The main place for this is Slack - an instant messaging service. The nf-core Slack organisation has channels dedicated for each pipeline, as well as specific topics (eg. #new-pipeliens, #tools and #aws ).

The nf-core Slack can be found at https://nfcore.slack.com (NB: no hyphen in nfcore!). To join you will need an invite, which you can get at https://nf-co.re/join/slack.

One additional tool which this author swears by is TLDR - it gives concise command line reference through example commands for most linux tools, including nextflow, docker, singularity, conda, git and more. There are many clients, but raylee/tldr is arguably the simplest - just a single bash script.

Installing the nf-core helper tools

Much of this tutorial will make use of the nf-core command line tool. This has been developed to provide a range of additional functionality for the project such as pipeline creation, testing and more.

The nf-core tool is written in Python and is available from the Python Package Index and Bioconda. You can install it from PyPI as follows:

pip install nf-core

If using conda, first set up for bioconda as described in the bioconda docs and then install nf-core:

conda install nf-core

The nf-core/tools source code is available at https://github.com/nf-core/tools

  • if you prefer, you can clone this repository and install the code locally:
git clone https://github.com/nf-core/tools.git nf-core-tools
cd nf-core-tools
python setup.py install

Once installed, you can check that everything is working by printing the help:

nf-core --help

Exercise 1 (installation)

  • Install nf-core/tools
  • Use the help flag to list the available commands

Listing available nf-core pipelines

As you saw from the --help output, the tool has a range of subcommands. The simplest is nf-core list, which lists all available nf-core pipelines. The output shows the latest version number, when that was released. If the pipeline has been pulled locally using Nextflow, it tells you when that was and whether you have the latest version.

If you supply additional keywords after the command, the listed pipeline will be filtered. Note that this searches more than just the displayed output, including keywords and description text. The --sort flag allows you to sort the list (default is by most recently released) and --json gives JSON output for programmatic use.

Exercise 2 (listing pipelines)

  • Use the help flag to print the list command usage
  • List all pipelines
  • Sort pipelines alphabetically, then by popularity (stars)
  • Fetch one of the pipelines using nextflow pull
  • Use nf-core list to see if the pipeline you pulled is up to date
  • Filter pipelines for those that work with RNA
  • Save these pipeline details to a JSON file

Running nf-core pipelines

Software requirements for nf-core pipelines

In order to run nf-core pipelines, you will need to have Nextflow installed (https://www.nextflow.io). The only other requirement is a software packaging tool: Conda, Docker or Singularity. In theory it is possible to run the pipelines with software installed by other methods (e.g. environment modules, or manual installation), but this is not recommended. Most people find either Docker or Singularity the best options.

Fetching pipeline code

Unless you are actively developing pipeline code, we recommend using the Nextflow built-in functionality to fetch nf-core pipelines. Nextflow will automatically fetch the pipeline code when you run nextflow run nf-core/PIPELINE. For the best reproducibility, it is good to explicitly reference the pipeline version number that you wish to use with the -revision/-r flag. For example:

nextflow run nf-core/rnaseq -revision 1.3

If not specified, Nextflow will fetch the master branch - for nf-core pipelines this will be the latest release. If you would like to run the latest development code, use -r dev.

Note that once pulled, Nextflow will use the local cached version for subsequent runs. Use the -latest flag when running the pipeline to always fetch the latest version. Alternatively, you can force Nextflow to pull a pipeline again using the nextflow pull command:

nextflow pull nf-core/rnaseq

Usage instructions and documentation

You can find general documentation and instructions for Nextflow and nf-core on the nf-core website: https://nf-co.re/. Pipeline-specific documentation is bundled with each pipeline in the /docs folder. This can be read either locally, on GitHub, or on the nf-core website. Each pipeline has its own webpage at https://nf-co.re/PIPELINE.

In addition to this documentation, each pipeline comes with basic command line reference. This can be seen by running the pipeline with the --help flag, for example:

nextflow run nf-core/rnaseq --help

Config profiles

Nextflow can load pipeline configurations from multiple locations. To make it easy to apply a group of options on the command line, Nextflow uses the concept of config profiles. nf-core pipelines load configuration in the following order:

  1. Pipeline: Default 'base' config
    • Always loaded. Contains pipeline-specific parameters and "sensible defaults" for things like computational requirements
    • Does not specify any method for software packaging. If nothing else is specified, Nextflow will expect all software to be available on the command line.
  2. Pipeline: Core config profiles
    • All nf-core pipelines come with some generic config profiles. The most commonly used ones are for software packaging: docker, singularity and conda
    • Other core profiles are awsbatch, debug and test
  3. nf-core/configs: Server profiles
    • At run time, nf-core pipelines fetch configuration profiles from the configs remote repository. The profiles here are specific to clusters at different institutions.
    • Because this is loaded at run time, anyone can add a profile here for their system and it will be immediately available for all nf-core pipelines.
  4. Local config files given to Nextflow with the -c flag
  5. Command line configuration

Multiple comma-separate config profiles can be specified in one go, so the following commands are perfectly valid:

nextflow run nf-core/rnaseq -profile test,docker
nextflow run nf-core/hlatyping -profile singularity,debug

Note that the order in which config profiles are specified matters. Their priority increases from left to right.

Running pipelines with test data

The test config profile is a bit of a special case. Whereas all other config profiles tell Nextflow how to run on different computational systems, the test profile configures each nf-core pipeline to run without any other command line flags. It specifies URLs for test data and all required parameters. Because of this, you can test any nf-core pipeline with the following command:

nextflow run nf-core/PIPELINE -profile test

Note that you will typically still need to combine this with a configuration profile for your system - e.g. -profile test,docker. Running with the test profile is a great way to confirm that you have Nextflow configured properly for your system before attempting to run with real data.

The nf-core launch command

Most nf-core pipelines have a number of flags that need to be passed on the command line: some mandatory, some optional. To make it easier to launch pipelines, these parameters are described in a JSON file bundled with the pipeline. The nf-core launch command uses this to build an interactive command-line wizard which walks through the different options with descriptions of each, showing the default value and prompting for values.


NOTE

This is an experimental feature - JSON file and rich descriptions of parameters is not yet available for all pipelines.


Once all prompts have been answered, non-default values are saved to a params.json file which can be supplied to nextflow to run the pipeline. Optionally, the nextflow command can be launched there and then.

To use the launch feature, just specify the pipeline name:

nf-core launch <PIPELINE>

============================================= TODO: Think of removing this section

Using nf-core pipelines offline

Many of the techniques and resources described above require an active internet connection at run time - pipeline files, configuration profiles and software containers are all dynamically fetched when the pipeline is launched. This can be a problem for people using secure computing resources that do not have connections to the internet.

To help with this, the nf-core download command automates the fetching of required files for running nf-core
pipelines offline. The command can download a specific release of a pipeline with -r/--release and fetch the singularity container if --singularity is passed (this needs Singularity to be installed). All files are saved to a single directory, ready to be transferred to the cluster where the pipeline will be executed.

Exercise 3 (using pipelines)

  • Install required dependencies (nextflow, docker)
  • Print the command-line usage instructions for the nf-core/rnaseq pipeline
  • In a new directory, run the nf-core/rnaseq pipeline with the provided test data
  • Try launching the RNA pipeline using the nf-core launch command
  • Download the nf-core/rnaseq pipeline for offline use using the nf-core download command

Creating nf-core pipelines

Using the nf-core template

The heart of nf-core is the standardisation of pipeline code structure. To achieve this, all pipelines adhere to a generalised pipeline template. The best way to build an nf-core pipeline is to start by using this template via the nf-core create command. This launches an interactive prompt on the command line which asks for things such as pipeline name, a short description and the author's name. These values are then propagated throughout the template files automatically.

TODO statements

Not everything can be completed with a template and all new pipelines will need to edit and add to the resulting pipeline files in a similar set of locations. To make it easier to find these, the nf-core template files have numerous comment lines beginning with TODO nf-core:, followed by a description of what should be changed or added. These comment lines can be deleted once the required change has been made.

Most code editors have tools to automatically discover such TODO lines and the nf-core lint command will flag these. This makes it simple to systematically work through the new pipeline, editing all files where required.

How nf-core software packaging works

The only hard requirement for all nf-core pipelines is that software must be available in Docker images. However, it is recommended that pipelines use the following methodology where possible:

  1. Software requirements are defined for Conda in environment.yml
  2. Docker images are automatically built on Docker Hub, using Conda
  3. Singularity images are generated from Docker Hub at run time for end users

This approach has the following merits:

  • A single file contains a list of all required software, making it easy to maintain
  • Identical (or as close as is possible) software is available for users using Conda, Docker or Singularity
  • Having a single container image for the pipeline uses disk space efficiently for Singularity images, and is simple to manage and transfer.

The reason that the above approach is not a hard requirement is that some issues can prevent it from working, such as:

  • It may not be possible to package software on conda due to software licensing limitations
  • Different packages may have dependency conflicts which are impossible to resolve

Alternative approaches are then decided upon on a case-by-case basis. We encourage you to discuss this on Slack early on as we have been able to resolve some such issues in the past.

Building environment.yml

The nf-core template will create a simple environment.yml file for you with an environment name, conda channels and one or two dependencies. You can then add additional required software to this file. Note that all software packages must have a specific version number pinned - the format is a single equals sign, e.g package=version.

Where software packages are not already available on Bioconda or Conda-forge, we encourage developers to add them. This benefits the wider community, as well as just users of the nf-core pipeline.

Running with Docker locally

You can use Docker for testing by building the image locally. The pipeline expects a container with a specific name, so you must tag the Docker image with this. You can build and tag an image in a single step with the following command:

docker build -t nfcore/PIPELINE:dev .

Note that it is nfcore without a hyphen (docker hub doesn't allow any punctuation). The . refers to the current working directory - if run in the root pipeline folder this will tell Docker to use the Dockerfile recipe found there.

Forks, branches and pull-requests

All nf-core pipelines use GitHub as their code repository, and git as their version control system. For newcomers to this world, it is helpful to know some of the basic terminology used:

  • A repository contains everything for a given project
  • Commits are code checkpoints.
  • A branch is a linear string of commits - multiple parallel branches can be created in a repository
  • Commits from one branch can be merged into another
  • Repositories can be forked from one GitHub user to another
  • Branches from different forks can be merged via a Pull Request (PR) on github.com

Typically, people will start developing a new pipeline under their own personal account on GitHub. When it is ready for its first release and has been discussed on Slack, this repository is forked to the nf-core organisation. All developers then maintain their own forks of this repository, contributing new code back to the nf-core fork via pull requests.

All nf-core pipelines must have the following three branches:

  1. master - commits from stable releases only. Should always have code from the most recent release.
  2. dev - current development code. Merged into master for releases.
  3. TEMPLATE - used for template automation by the @nf-core-bot GitHub account. Should only contain commits with unmodified template code.

Pull requests to the nf-core fork have a number of automated steps that must pass before the PR can be merged. A few points to remember are:

  • The pipeline CHANGELOG.md must be updated
  • PRs must not be against the master branch (typically you want dev)
  • PRs should be reviewed by someone else before being merged

Setting up Docker and Travis CI

When you fork your pipeline repository to the nf-core organisation, one of the core team will set up Travis CI (automated testing) and Docker Hub (automated Docker image creation) for you. However, it can be helpful to set these up on your personal fork as well. That way, you can be confident that everything will work when you fork or open a PR on the nf-core organisation.

Both services are free to use. To set them up, visit https://travis-ci.com and https://hub.docker.com and link your personal GitHub repository.

Exercise 4 (creating pipelines)

  • Make a new pipeline using the template
  • Update the readme file to fill in the TODO statements
  • Add a new process to the pipeline in main.nf
  • Add the new software dependencies from this process in to environment.yaml

Testing nf-core pipelines

Linting nf-core pipelines

Manually checking that a pipeline adheres to all nf-core guidelines and requirements is a difficult job. Wherever possible, we automate such code checks with a code linter. This runs through a series of tests and reports failures, warnings and passed tests.

The linting code is closely tied to the nf-core template and both change over time. When we change something in the template, we often add a test to the linter to make sure that pipelines do not use the old method.

Each lint test has a number and is documented on the nf-core website. When warnings and failures are reported on the command line, a short description is printed along with a link to the documentation for that specific test on the website.

Code linting is run automatically every time you push commits to GitHub, open a pull request or make a release. You can also run these tests yourself locally with the following command:

nf-core lint /path/to/pipeline

When merging PRs from dev to master, the lint command will be run with the --release flag which includes a few additional tests.

nf-core/test-datasets

When adding a new pipeline, you must also set up the test config profile. To do this, we use the nf-core/test-datasets repository. Each pipeline has its own branch on this repository, meaning that the data can be cloned without having to fetch all test data for all pipelines:

git clone --single-branch --branch PIPELINE https://github.com/nf-core/test-datasets.git

To set up the test profile, make a new branch on the nf-core/test-datasets repo through the web page (see instructions). Fork the repository to your user and open a PR to your new branch with a really (really!) tiny dataset. Once merged, set up the conf/test.config file in your pipeline to refer to the URLs for your test data.

These test datasets are used by the automated continuous integration tests. The systems that run these tests are extremely limited in the resources that they have available. Typically, the pipeline should be able to complete in around 10 minutes and use no more than 6-7 GB memory. To achieve this, input files and reference genomes need to be very tiny. If possible, a good approach can be to use PhiX or Yeast as a reference genome. Alternatively, a single small chromosome (or part of a chromosome) can be used. If you are struggling to get the tests to run, ask for help on Slack.

When writing conf/test.config remember to define all required parameters so that the pipeline will run with only -profile test. Note that remote URLs cannot be traversed like a regular file system - so glob file expansions such as *.fa will not work.

Travis CI configuration

The automated tests with Travis CI are configured in the .travis.yml file that is generated by the template. The script block defines three tests: linting the code with nf-core lint, linting the syntax of all Markdown documentation and running the pipeline with the test data.

The env section sets the NXF_VER environment variable twice. This tells Travis to run the tests twice in parallel

  • once with the latest version of Nextflow (NXF_VER='') and once with the minimum version supported by the pipeline. Do not edit this version number manually - it appears in multiple locations through the pipeline code, so it's better to use nf-core bump-version --nextflow instead.

The provided tests may be sufficient for your pipeline. However, if it is possible to run the pipeline with significantly different options (for example, different alignment tools), then it is good to test all of these. You can do this by adding additional commands in the script block.

Exercise 5 (testing pipelines)

  • Run nf-core lint on your pipeline and make note of any test warnings / failures
  • Read up on one or two of the linting rules on the nf-core website and see if you can fix some.
  • Take a look at conf/test.config and switch the test data for another dataset on nf-core/test_data.

Releasing nf-core pipelines

Your pipeline is written and ready to go! Before you can release it with nf-core there are a few steps that need to be done. First, tell everyone about it on Slack in the #new-pipelines channel. Hopefully you've already done this before you spent lots of time on your pipeline, to check that there aren't other similar efforts happening elsewhere. Next, you need to be a member of the nf-core GitHub organisation. You can find instructions for how to do this at https://nf-co.re/join.

Forking to nf-core

Once you're ready to go, you can fork your repository to nf-core. A lot of stuff happens automatically when you do this: the website will update itself to include your new pipeline, complete with rendered documentation pages and usage statistics. Your pipeline will also appear in the nf-core list command output and in various other locations.

Unfortunately, at the time of writing, Travis CI, Docker Hub and Zenodo (automated DOI assignment for releases) services are not created automatically. These can only be set up by nf-core administrators, so please ask someone to do this for you on Slack.

Initial community review

Once everything is set up and all tests are passing on the dev branch, let us know on Slack and we will do a large community review. This is a one-off process that is done before the first release for all pipelines. In order to give a nice interface to review all pipeline code, we create a "pseudo pull request" comparing dev against the first commit in the pipeline (hopefully the template creation). This PR will never be merged, but gives the GitHub review web pages where people can comment on specific lines in the code.

These first community reviews can take quite a long time and typically result in a lot of comments and suggestions (nf-core/deepvariant famously had 156 comments before it was approved). Try not to be intimidated - this is the main step where the community attempts to standardise and suggest improvements for your code. Your pipeline will come out the other side stronger than ever!

Making the first release

Once the pseudo-PR is approved, you're ready to make the release. To do this, first bump the pipeline version to a stable tag using nextflow bump-version, then open a pull-request from the dev branch to master. Once tests are passing and two nf-core members have approved this PR, it can be merged to master. Then a GitHub release is made, using the contents of the changelog as a description.

Pipeline version numbers (release tags) should be numerical only, using semantic versioning. For example, with a release version 1.4.3, bumping 1 would correspond to the major release where results would no longer be backwards compatible. Changing 4 would be a minor release, for example adding some new features. Changing 3 would be a patch release for minor things such as fixing bugs.

Template updates

Over time, new versions of nf-core/tools will be released with changes to the template. In order to keep all nf-core pipelines in sync, we have developed an automated synchronisation procedure. A GitHub bot account, @nf-core-bot is scripted on a new tools release to use nf-core create with the new template using the input values you used on your pipeline. This is committed to the TEMPLATE branch and a pull-request created to incorporate these changes into dev.

Note that these PRs can sometimes create git merge conflicts which will need to be resolved manually. There are plugins for most code editors to help with this process. Once resolved and checked this PR can be merged and a new pipeline release created.

Exercise 6 (releasing pipelines)

  • Use nf-core bump-version to update the required version of Nextflow in your pipeline
  • Bump your pipeline's version to 1.0, ready for its first release!
  • Make sure that you're signed up to the nf-core slack (get an invite on nf-co.re) and drop us a line about your latest and greatest pipeline plans!
  • Ask to be a member of the nf-core GitHub organisation by commenting on this GitHub issue
  • If you're a twitter user, make sure to follow the @nf_core account

Conclusion

I hope that this nf-core tutorial has been helpful! Remember that there is more in-depth documentation on many of these topics available on the nf-core website. If in doubt, please ask for help on Slack.

If you have any suggestions for how to improve this tutorial, or spot any mistakes, please create an issue or pull request on the nf-core/nf-co.re repository.

About

nf-core training material for FAANG workshop February 2020

License:Apache License 2.0