rothlab / tileseq_package

DMS Tileseq sequence analysis pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TileSEQ Analysis Package v1.5

Song Sun, Roth Lab 2017

Description, Prerequisites and Contents

These scripts were written to process fastq.gz files (e.g., 1_S1_L001_R1_001.fastq.gz / 1_S1_L001_R2_001.fastq.gz) generated by TileSEQ by submitting jobs to Sun Grid Engine (SGE) queuing system. The package contains three perl scripts: TileSEQ_FASTQ2Function.pl tileseq_mut2func.pl tileseq_sam2mut.pl and two example input files: CBS_seq.txt mut2func_info.csv

SunGridEngine, Perl and Bowtie2 are required to be installed. As this program makes use of high-performance cluster computing, it should only be executed on a suitable computing cluster with multiple nodes.

Usage

Copy the three perl scripts into the directory where the fastq.gz files are.

Create two input files for your own gene and experiments:

_seq.txt, where is a placeholder for your gene name of choice (e.g. UBE2I). The file should contain the template sequence for the alignment and the coding sequence including a stop codon. The gene name should be spelled in captial letters. The file requires a specific format consisting of two lines: the first line is _template followed by a space or tab and then the actual template sequence including the coding sequence and upstream/downstream sequences, the second line is _coding followed by a space or tab and then the actual coding sequence.

A file called mut2func_info.csv, which contains seven lines. The first line contains the corresponding amino acid positions for each tile. The rest six lines contains the sequencing sample name for each multiplexed librarys sorted into each of the six experiments: nonselect1, nonselect2, select1, select2, wt1, wt2. Note: for quality check, the mut2func_info.csv should only contain three lines including the first line with tile information, and two lines with sequencing sample name for nonselect1 and nonselect2 (in QC, nonselect1 and nonselect2 are the same) Run the control perl script TileSEQ_FASTQ2Function.pl after reading the doumentation in the beginning of the script. Three arguments are required for this script.

  • The gene name, e.g. UBE2I
  • A Qscore cutoff, e.g. 20
  • The full path for the location of the bowtie2 executable, e.g. /home/rothlab/jweile/bin

Six new directories will be created:

  • bowtieIndex: Bowtie2 index files for your gene
  • shfile: all the .sh files for the job submission
  • fastqfile: all the unzipped fastq files
  • samfile: all the alignment files in sam format
  • mutationCallfile: six files wil begenerated for each library (e.g., 1report.txt, 1AAchange.txt, 1MultipleMut.txt, 1deletion.txt, 1insertion.txt, 1noncoding.txt
  • resultfile: five result files will be generated (correlation_control.txt, correlation_nonselect.txt, correlation_select.txt, foldchange.txt, rawData.txt)

About

DMS Tileseq sequence analysis pipeline


Languages

Language:Perl 90.9%Language:HTML 5.3%Language:Python 3.8%