bsipos / paper-pasp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A pipeline to analyse PASP data

Reference

This repository contains the pipeline and raw results described in the manuscript:

  • Botond Sipos, Adrian M. Stütz, Greg Slodkowicz, Tim Massingham, Jan Korbel, Nick Goldman: PASP - a whole-transcriptome poly(A) tail length determination assay for the Illumina platform.

Click here for more details on the wetlab experiment.

Using the pipeline

The pipeline can be used by invoking the following make targets:

  • Fetch raw data from ENA: make fetch
  • Generate transcriptome from SGD annotation: make transcriptome
  • Align and parse reads: make parse
  • Test for differential polyadenylation: make test or make lsf_test
  • Parse spike-in reads (make parse_spikeins) and build "error model" (make error_model)
  • Filter test results by G-tail coverage: make gtail_cov_filter
  • Plot correlation between technical replicates: make gtail_tech_corr
  • Plot and cluster tail length distributions: make classify_tail_dists
  • Correlate thresholded tail lengths with PASTA and PAL-seq: make corr_with_studies

Index of selected raw results

Illumina reads

Sequencing data are available in the ArrayExpress database under accession number E-MTAB-2456.

Alignment

Sample Alignment log Alignment report
WT1A http://bit.ly/PAT-seq-WT1A_aln_log http://bit.ly/PAT-seq-WT1A_align_pdf
WT1B http://bit.ly/PAT-seq-WT1B_aln_log http://bit.ly/PAT-seq-WT1B_align_pdf
WT1C http://bit.ly/PAT-seq-WT1C_aln_log http://bit.ly/PAT-seq-WT1C_align_pdf`
WT1D http://bit.ly/PAT-seq-WT1D_aln_log http://bit.ly/PAT-seq-WT1D_align_pdf
WT2A http://bit.ly/PAT-seq-WT2A_aln_log http://bit.ly/PAT-seq-WT2A_align_pdf
WT2B http://bit.ly/PAT-seq-WT2B_aln_log http://bit.ly/PAT-seq-WT2B_align_pdf
WT2C http://bit.ly/PAT-seq-WT2C_aln_log http://bit.ly/PAT-seq-WT2C_align_pdf
WT2D http://bit.ly/PAT-seq-WT2D_aln_log http://bit.ly/PAT-seq-WT2D_align_pdf
MUT1A http://bit.ly/PAT-seq-MUT1A_aln_log http://bit.ly/PAT-seq-MUT1A_aln_pdf
MUT1B http://bit.ly/PAT-seq-MUT1B_aln_log http://bit.ly/PAT-seq-MUT1B_aln_pdf
MUT1C http://bit.ly/PAT-seq-MUT1C_aln_log http://bit.ly/PAT-seq-MUT1C_aln_pdf`
MUT1D http://bit.ly/PAT-seq-MUT1D_aln_log http://bit.ly/PAT-seq-MUT1D_aln_pdf
MUT2A http://bit.ly/PAT-seq-MUT2A_aln_log http://bit.ly/PAT-seq-MUT2A_aln_pdf
MUT2B http://bit.ly/PAT-seq-MUT2B_aln_log http://bit.ly/PAT-seq-MUT2B_aln_pdf
MUT2C http://bit.ly/PAT-seq-MUT2C_aln_log http://bit.ly/PAT-seq-MUT2C_aln_pdf
MUT2D http://bit.ly/PAT-seq-MUT2D_aln_log http://bit.ly/PAT-seq-MUT2D_aln_pdf

Parsing alignments

Sample Parse log Parse report
WT1A http://bit.ly/PAT-seq-WT1A_parse_log http://bit.ly/PAT-seq-WT1A_parse_pdf
WT1B http://bit.ly/PAT-seq-WT1B_parse_log http://bit.ly/PAT-seq-WT1B_parse_pdf
WT1C http://bit.ly/PAT-seq-WT1C_parse_log http://bit.ly/PAT-seq-WT1C_parse_pdf
WT1D http://bit.ly/PAT-seq-WT1D_parse_log http://bit.ly/PAT-seq-WT1D_parse_pdf
WT2A http://bit.ly/PAT-seq-WT2A_parse_log http://bit.ly/PAT-seq-WT2A_parse_pdf
WT2B http://bit.ly/PAT-seq-WT2B_parse_log http://bit.ly/PAT-seq-WT2B_parse_pdf
WT2C http://bit.ly/PAT-seq-WT2C_parse_log http://bit.ly/PAT-seq-WT2C_parse_pdf
WT2D http://bit.ly/PAT-seq-WT2D_parse_log http://bit.ly/PAT-seq-WT2D_parse_pdf
MUT1A http://bit.ly/PAT-seq-MUT1A_parse_log http://bit.ly/PAT-seq-MUT1A_parse_pdf
MUT1B http://bit.ly/PAT-seq-MUT1B_parse_log http://bit.ly/PAT-seq-MUT1B_parse_pdf
MUT1C http://bit.ly/PAT-seq-MUT1C_parse_log http://bit.ly/PAT-seq-MUT1C_parse_pdf
MUT1D http://bit.ly/PAT-seq-MUT1D_parse_log http://bit.ly/PAT-seq-MUT1D_parse_pdf
MUT2A http://bit.ly/PAT-seq-MUT2A_parse_log http://bit.ly/PAT-seq-MUT2A_parse_pdf
MUT2B http://bit.ly/PAT-seq-MUT2B_parse_log http://bit.ly/PAT-seq-MUT2B_parse_pdf
MUT2C http://bit.ly/PAT-seq-MUT2C_parse_log http://bit.ly/PAT-seq-MUT2C_parse_pdf
MUT2D http://bit.ly/PAT-seq-MUT2D_parse_log http://bit.ly/PAT-seq-MUT2D_parse_pdf

Quantifying tail length slippage using spike-in standards

  • Tail run lengths until the first 1-5 non-A bases in reads mapped to spike-in poly(A) tracts PDF

Testing differences between wild type and mutant tail runs

Comparison Test log Test report Results
WT1 vs. MUT1 TXT PDF CSV
WT2 vs. MUT2 TXT PDF CSV

Tail run distributions from all transcripts with G-tail coverage > 1000

Cross-study correlation

| WT1 vs. MUT1 | http://bit.ly/PAT-seq-TEST_WT1_vs_MUT1_log | http://bit.ly/PAT-seq-TEST_WT1_vs_MUT1_pdf | http://bit.ly/PAT-seq-TEST_WT1_vs_MUT1_trs_tab | | WT2 vs. MUT2 | http://bit.ly/PAT-seq-TEST_WT2_vs_MUT2_log | http://bit.ly/PAT-seq-TEST_WT2_vs_MUT2_pdf | http://bit.ly/PAT-seq-TEST_WT2_vs_MUT2_trs_tab |

Tail run distributions from all transcripts with G-tail coverage > 1000

Cross-study correlation

Dependencies

Using the analysis tools

The analysis tool can be found under patsy/:

patsy-align - classify read pairs and align them using Bowtie2

usage: patsy-align [-h] -1 fq1 -2 fq2 -f ref [-o outdir] [-s stats_pickle]
                   [-l gtail_sig] [-G gtag_min] [-N max_N] [-I min_fsize]
                   [-X max_fsize] [-p nr_threads] [-r report]

Align PASP reads using Bowtie2 (1.1).

optional arguments:
  -h, --help       show this help message and exit
  -1 fq1           First FASTQ file.
  -2 fq2           Second FASTQ file.
  -f ref           Reference fasta.
  -o outdir        Output directory.
  -s stats_pickle  Stats pickle file.
  -l gtail_sig     Portion of read start/end used for G-tail classification
                   (14).
  -G gtag_min      Minimum G-tag length(3).
  -N max_N         Maximum number of Ns in the first -l bases (6).
  -I min_fsize     Minimum fragment size (0).
  -X max_fsize     Maximum fragment size (500).
  -p nr_threads    Number of threads to use (1).
  -r report        Report PDF.

patsy-parse - parse classified and aligned PASP read pairs

usage: patsy-parse [-h] -g gtail_sam -n nvtr_sam -d dataset_id -f ref
                   [-l gtail_sig] [-G gtag_min] [-N max_N] [-e err_tol]
                   [-o out_pickle] [-i tr_list] [-q min_q] [-r report] [-t]

Parse classified and aligned PASP read pairs (1.1).

optional arguments:
  -h, --help     show this help message and exit
  -g gtail_sam   SAM file containing G-tail alignments.
  -n nvtr_sam    SAM file containing NVTR alignments.
  -d dataset_id  Dataset identifier.
  -f ref         Reference fasta.
  -l gtail_sig   Portion of read start/end used for G-tail classification.
  -G gtag_min    Minimum G-tag length(3).
  -N max_N       Maximum number of Ns in the first -l bases (6).
  -e err_tol     Number of errors tolerated in the tail.
  -o out_pickle  Output pickle file.
  -i tr_list     List of transcripts considered.
  -q min_q       Mapping quality treshold (30).
  -r report      Report PDF.
  -t             Plot per-transcript coverage reports.

patsy-test - test for differential polyadenylation in PASP data

usage: patsy-test [-h] -a [a_pickles [a_pickles ...]] -na a_name -b
                  [b_pickles [b_pickles ...]] -nb b_name [-i lrt_list]
                  [-P lik_penalty] [-M min_size_U] [-s sig_level]
                  [-op out_pickle] [-ot out_trs] [-og out_glob]
                  [-otr out_runs_prefix] [-orr out_rep_prefix] [-r report]
                  [-t]

Test for differential polyadenylation in PASP data (1.1).

optional arguments:
  -h, --help            show this help message and exit
  -a [a_pickles [a_pickles ...]]
                        Parsed read pickles - group A.
  -na a_name            Name of data group A.
  -b [b_pickles [b_pickles ...]]
                        Parsed read pickles - group B.
  -nb b_name            Name of data group B.
  -i lrt_list           Transcripts to be tested with anchors LRT.
  -P lik_penalty        Log-likelihood penalty for data points outside valid
                        range.
  -M min_size_U         Minimum sample size when performing Mann-Whitney U
                        test (30).
  -s sig_level          Significance level.
  -op out_pickle        Output pickle file.
  -ot out_trs           Output tabular file: transcript properties.
  -og out_glob          Output tabular file: global results.
  -otr out_runs_prefix  Output tabular file: tail runs prefix.
  -orr out_rep_prefix   Output tabular file: tail means per replicate.
  -r report             Report PDF.
  -t                    Plot reports for all transcripts.

patsy-spike - estimate the number of sequencing errors in runs of bases.

usage: patsy-spike [-h] -n spike_sam -f ref [-w window] [-m max_errors_plot]
                   [-o out_pickle] [-q min_q] [-r report] [-l read_len]
                   [-pk pickle]

Estimate the number of sequencing errors in runs of bases (1.0).

optional arguments:
  -h, --help          show this help message and exit
  -n spike_sam        SAM file containing NVTR alignments.
  -f ref              Reference fasta with the spike-in sequences.
  -w window           Size of the flanking sequence around the run of As.
  -m max_errors_plot  Maximum number of errors for which to plot the length
                      distribution.
  -o out_pickle       Output pickle file.
  -q min_q            Mapping quality treshold (30).
  -r report           Report PDF.
  -l read_len         Read length.
  -pk pickle          Result pickle file.

About

License:Other


Languages

Language:Python 81.5%Language:Makefile 11.7%Language:R 6.9%