bioinformatics bioinformatics-analysis bioinformatics-pipeline computational-biology data-science reproducible-research reproducible-science

Guidelines for accountable and reproducible computational research

Why?

Computational projects should contain a description sufficient to understand the data, processing steps, and results.

It should be possible to reproduce the results from data by running the scripts

Core principles:

Describe data sources, versions, formats and necessary metadata
Describe your environment, workflow, tools, versions, and parameters
Track changes with meaningful comments and version tags
Document your code with testable examples, let others review your code
Execute workflows without manual intervention
Figures and tables should be reproducible by running scripts

README.md

Have a readme file that serves as the project home page, it should describe the project scope, and provide an inventory of code, data and other files. In many cases, a single readme file would be sufficient. Markdown is a convenient format for writing documentation.

Version control

Track the changes with git.

Avoid creating multiple copies of files.

Git repositories can be local, using git it does not mean your project is open to anyone. But if someone else needs to edit the same code, the same files – git makes it so much easier.

Git is easier than you might think. For basic usage, you will only need: git init, git add, git commit commands. Besides, most of the development environments have built-in support for git, for example, RStudio has it.

Do not add your large data files to git, use .gitignore to exclude files and directories. If needed, use Git Large File Storage (LFS), which is a git extension that can handle large files.

Data provisioning

Keep track of how you obtained the data and how it was generated and if it was already preprocessed.

Use accession numbers or timestamps to version the dataset.

If multiple versions are used keep the old version whenever possible.

Make sure your data is backed up and that the frequency of backups is sufficient.

Always follow privacy guidelines for protected data, such as dbGaP.

Minimize moving the data between computers.

If data was obtained from a repository keep a snapshot whenever possible.

It is best to have scripts that download the data rather than steps that require manual intervention.

Data formats evolve and often have versions, make a copy of data format description from websites/repositories.

Workflows

Computational workflow describes the sequence of steps necessary to produce the results. Run workflow as a script. Avoid manual intervention.

Even if you are just experimenting – look up the commands that you typed (run history) and write a script.

Names and versions of tools as well as their exact parameters are extremely important for reproducibility.

If you obtained some code/packages directly from Github – remember to note the commit number or tag name, because when you will check out the same repository again – it will already be a different version.

It is also necessary to document the environment: for instance, use conda, R sessioninfo, MRAN snapshots. You don't need admin permissions to install conda. Most of the popular genomics tools are available in conda (bioconda), if not - it is easy to create a "recipe" for a new tool. Conda environment is stored as environment.yml file, which you can keep with your code and track changes with git.

Initialize random seed in probabilistic algorithms (using permutations, bootstrap, cross-validation).

The same workflow may give different results depending on a computer platform, OS version and system libraries. The best case is to have a Docker/Singularity/AWS container, but realistically it's not always possible. If you don't have a container – make notes about the computer and OS version.

Make sure your workflow uses the intended version/snapshot of data. If you update the dataset, for instance for a paper revision, add a new directory whenever possible, do not just replace old files, your previous results will not be reproducible.

There are numerous ways to create and document workflow. The simplest way would be to list the commands specifying input, output, and parameters in a shell script, Python script, or an R script. Some of the more advanced methods are Snakemake, CWL, Nextflow. Store the workflow scripts with your code and track them with git, you should be able to retrieve an older revision of the workflow.

Readme file should describe steps necessary to install the environment and run the workflow.

Code

Use git to track changes in code.

Git can also help you to synchronize code across the computers, e.g. between desktop, laptop, and Biowulf.

Make small changes at a time that can be described with one sentence (git commit). It will be much easier to find changes that broke something and roll them back.

Sprinkle your code with assertions (assert() in Python, stopifnot() in R) to check that your data, parameters, and variables are within the expected ranges. It is the easiest way to catch bugs early on.

Use small test files to check that your program works as intended. Save test files in git alongside with your code. Write a short script that runs the test and checks the results. It could be one line of code. Make it a habit to run this test every time you introduce any changes. It will serve as an example for someone who is trying to figure out how to use your program.

If you find a bug – make a test example out of it. No extra work is needed – you probably already have a case where the program fails or returns incorrect results. This way you will never have to face the same bug again. A collection of tests is called test suite – run it every time you make changes in the code.

Show your code to others in the lab. When your project reaches a milestone, for instance, paper is submitted or is about to be submitted – initiate a code review. Code review is similar to when you ask others to read your paper draft. During code review, someone else will examine the project's structure, read documentation and walk through the critical sections of the analysis code. Not only it is an effective quality control measure. Reviews greatly increase the chance that other lab members will reuse (parts of) your code in their projects. Also, if or when you leave the lab, people will be able to figure things out. If project code is intended to be published as a stand-alone software package – a review will help to wrap it up. The ultimate test is when someone else can reproduce your results using your workflow and your code.

Figures and tables

Figures and Tables should be generated by code. And code that generates figures should be tracked by git.

You don't want to generate a plot in an R session, save the plot and move on with your analysis, the plot will not be reproducible.

It is highly recommended to save tables with data shown in the figures, you will be able to change how the figure looks without rerunning the whole workflow. Often you will need it for supplemental materials in any case.

Note that sometimes plotting involve randomization techniques, therefore to reproduce the exact plot you will need to set a random number generator seed in the code.

Jupyter (Python, R, shell) and RStudio (R Markdown) notebooks are the best way to integrate text, workflow, code, figures, and tables. In a single file, you can have a complete report or even a paper draft. Moreover, you will be able to change and rerun a single section of the report as many times as you need without re-running the whole thing. You can export the report in HTML, PDF, DOC or print it out. You can even run a notebook on a remote (more powerful) computer. Some notebooks allow you to mix sections written in different programming languages.

About

Guidelines for accountable and reproducible computational research

bioinformatics bioinformatics-analysis bioinformatics-pipeline computational-biology data-science reproducible-research reproducible-science

Creative Commons Zero v1.0 Universal

neksa / reproducible-research-guidelines

Guidelines for accountable and reproducible computational research

Why?

Core principles:

README.md

Version control

Data provisioning

Workflows

Code

Figures and tables

Further reading

About