fg6 / YeastStrainsStudy

Scripts to download the PacBio, ONT and MiSeq datasets used in https://www.nature.com/articles/s41598-017-03996-z and run the pipelines as described in the paper or simply download the final assemblies as generated by the authors.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

YeastStrainsStudy

Scripts to download the PacBio, ONT and MiSeq datasets used in https://www.nature.com/articles/s41598-017-03996-z and run the pipelines as described in the paper or simply download the final assemblies as generated by the authors.

Instructions

Download repository:

git clone https://github.com/fg6/YeastStrainsStudy.git

Usage:

$ ./launchme.sh <command> <strain>
  command: command to be run. Options: install, download, check, deepcheck, clean, nanoclean, 
  							             finalfastas, findassembly 
  strain: Download data for this strain/s, only for command=download, check or deepcheck
	  Options: s288c,sk1,cbs,n44,all [s288c]

Download data and utilities

With the script launchme.sh you can download the whole datasets used in the analysis of the paper https://www.nature.com/articles/s41598-017-03996-z to run the pipelines yourself, or download only the final assemblies generated by the authors of the paper.

!!! Warning !!!: due to a recent protocol change in the EBI database this scripts fails to export 
	 MiSeq cram files to fastqs. If you are experiencing this problem please use scramble 
	 (https://www.biorxiv.org/content/early/2014/03/28/003640) to export to fastqs, 
	 or download the fastq files directly from ENA.

Download only the final assemblies

To just look at the assemblies generated by the pipelines:

Step 1. Download the assemblies:
$ ./launchme.sh finalfastas
Step 2. List the assemblies selecting strain, assembler and or platform:
$ ./launchme.sh findassembly

!!!!!   Warning  !!!!! 
This script is interactive: It will ask you which strain, assembler or platform you want to focus on

Download all the data to run the pipelines:

Step 1. Download and install needed codes and scripts:
$ ./launchme.sh install
Step 2. Download data and prepare the fastq files:
$ ./launchme.sh download <strain> 

strain= s288c, sk1, n44, cbs or all  [s288c]
Step 3. Once the data have been downloaded and the fastq files prepared, check the fastq files:
$ ./launchme.sh check <strain> 

    strain= s288c, sk1, n44, cbs or all  [s288c]

If the check give you warnings, probably some file failed to download properly, 
follow the instructions given in the output
If the instructions do not help, try with 

$ ./launchme.sh deepcheck <strain>
Step 4/A. If everything looks ok and there are no warnings from Step 3, you can clean up the data folders, deleting every intermediate files and folders:
    $ ./launchme.sh clean <strain>

!!!!!   Warning  !!!!! 
1. Please run this only after Step 3 and only if Step 3 showed no errors or warnings, 
	otherwise you will have to download everything again!
2. Please do not run this if you intend to run Nanopolish, 
        as Nanopolish needs the s288c fast5 files, run instead Step 4/B
Step 4/B. If everything looks ok and there are no warnings, you can clean up the data folders, deleting every intermediate files and folders not needed by Nanopolish:
    $ ./launchme.sh nanoclean <strain>

    !!!!!   Warning  !!!!!
    Please run this only after Step 3 and only if Step 3 showed no errors or warnings,
      otherwise you will have to download everything again!
Disk space required:

If not cleaning up: 1.7TB

After cleaning all (clean): < 30GB.

After cleaning all except files for Nanopolish (nanoclean): ~700GB

Requirements for installing and preparing data:

A python version >= 2.7 is needed. Please make sure this is available in your PATH, together with virtualenv. C++11 required.

Pipelines

After 'launchme.sh', you can run the various pipelines, from the 'pipelines' folder

example:

cd pipelines	
./canu.sh <canu_location> <strain> <platform> <cov>

For details on the pipelines look at pipelines/README.md or launch each script with option "-h"

Warning! Please notice that the assemblers and scaffolders (except for smis) are not installed by the launchme.sh script. To run the pipelines you need to have installations of:

Abruijn (https://github.com/fenderglass/ABruijn)

Canu (https://github.com/marbl/canu)

PBcR (http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR)

Falcon-integrate (https://github.com/PacificBiosciences/FALCON-integrate)

Smartdenovo (https://github.com/ruanjue/smartdenovo)

MiniAsm and MiniMap (https://github.com/lh3/miniasm,https://github.com/lh3/minimap/)

Racon (https://github.com/isovic/racon)

Nanopolish (https://github.com/jts/nanopolish)

SPAdes (http://bioinf.spbau.ru/spades)

npScarf(https://github.com/mdcao/npScarf).

Additional software needed: bwa (https://github.com/lh3/bwa), samtools (https://github.com/samtools/samtools), bamtools (https://github.com/pezmaster31/bamtools)

About

Scripts to download the PacBio, ONT and MiSeq datasets used in https://www.nature.com/articles/s41598-017-03996-z and run the pipelines as described in the paper or simply download the final assemblies as generated by the authors.


Languages

Language:Shell 94.1%Language:C++ 4.9%Language:Makefile 0.6%Language:Ruby 0.2%Language:Python 0.2%