WGS_TCGA_Pipeline

Snakemake Pipeline to Mine WGS Data for Contaminant Reads

This pipeline is not actually TCGA specific, but it does require as input WGS short reads mapped against a human host genome in .bam format.
Some code (namely, the sample parsing in samples.smk and decontamination in fastp.smk) has been borrowed and modified from hecatomb.
The only input required are the relevant BAM format files, which all must be placed in a certain directory specified with the 'Bams' config.
Only software requirement is conda and that snakemake be in the $PATH. The rest of the required programs should install via conda
extract_reads.py has been taken from KrakenTools.

Usage

snakemake -c 1 -s DownloadDB.smk

This was split from the main pipeline due to the massive data size of the input data files (and so can be run independently of the main pipeline, allowing you to delete the raw BAMs once the unaligned reads have been extracted).
All you need to specify is the Bams directory, and an output directory.

snakemake -c 16 -s extract_unaligned_fastq --use-conda --config Bams=Bams/ Output=TCGA_Output/

snakemake -c 16 -s wgs_runner.smk --use-conda --config Output=my_output_dir/

For offline only use (e.g. Adelaide Uni Phoenix HPC) - the conda envs need to be installed first on the login node in the pipeline directory before running step 3. above.

snakemake -c 1 -s wgs_runner.smk --use-conda --config Output=my_output_dir/ --conda-create-envs-only --conda-frontend conda