InSilico PCR

This Analysis aims at extracting genomic sequences mimicking 16S amplicon reads from a larger context long read set (gDNA reads or larger amplicons).

The motivation behind this analysis was to simulate 16S long amplicon sequencing on ONT platform using data obtained from the https://github.com/LomanLab/mockcommunity site as a large Promethion fastq archive but any other ONT or Pacbio fastq data can be used with this code (please acknowledge them when you would use this data too).

Data Availability

(reproduced from https://github.com/LomanLab/mockcommunity/edit/master/README.md)

Name	Reads (M)	Yield (G)	FASTQ	Run Folder	Restarts	FAST5
Zymo-PromethION-LOG-BB-SN	35.1	148	fastq.gz	64h run	restarts	download.sh, restarts.tar
Zymo-PromethION-EVEN-BB-SN	36.5	146	fastq.gz	64h run	restarts	download.sh, restarts.tar
Zymo-GridION-LOG-BB-SN	3.7	16	fastq.gz	48h run	n/a	signal.tar
Zymo-GridION-EVEN-BB-SN	3.5	14	fastq.gz	48h run	n/a	signal.tar

Method

The method used to extract sequences between primers was developed by Brian Bushnell based on his BBMap tools and is explained here.

The workflow is as follows:

split the data in smaller fastq bins for speed-up using parallel
search forward primer in all bins using BBMap msa.sh
search reverse primer in all bins using BBMap msa.sh
extract 'matching' regions using BBMAp cutprimers.sh
merge all results and keep only amplicons in a chosen size range (as set in the script)

Usage

REM: Snakemake does not always install well with bioconda and can be removed from the file environment.yaml if you encounter issues. It is not used in the current version of this code anyway and was added here for forward compatibility.

Install the required software using conca or manually based on the list provided in environment.yaml.
Set variables, names, and numeric limits in the top of the InSilico_PCR.sh script (adjust the number of threads to the available cores in your own machine)
Run the script with
- demo data: ./InSilico_PCR.sh -d
- your own read file present in the current folder: ./InSilico_PCR.sh <file_name>.fq.gz

Results obtained with the demo data

the results of a 16S In-Silico PCR experiment are presented in ZymoPromethion_even_Results.
the results of using other databases for the MetONTIIME classification are reported here.

Future plans

This code should and will be changed to a snakemake pipeline in order to be more portable. The config.yaml file is the first step towards this transition.

Please send comments and feedback to nucleomics.bioinformatics@vib.be

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Nucleomics-VIB / InSilico_PCR