andrewGhazi / mpradesigntools

A tool for generating barcoded Massively Parallel Reporter Assay sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mpradesigntools

An R package for generating barcoded Massively Parallel Reporter Assay sequences

Publication

If you make use of this software, please cite the following publication:

Andrew R Ghazi, Edward S Chen, David M Henke, Namrata Madan, Leonard C Edelstein, Chad A Shaw; Design tools for MPRA experiments, Bioinformatics, Volume 34, Issue 15, 1 August 2018, Pages 2682–2683, https://doi.org/10.1093/bioinformatics/bty150

Installation

Dependencies

MPRA Design Tools depends on the Biostrings and BSgenome.Hsapiens.UCSC.hg38 packages from Bioconductor. First install these in R with the following commands:

source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
biocLite("BSgenome.Hsapiens.UCSC.hg38")

The package also makes use of some tidyverse packages which can be installed with the following commands:

install.packages(c('dplyr', 'magrittr', 'purrr', 'readr', 'stringr', 'tibble', 'tidyr', 'purrrlyr'))

Package Installation

If you don't have the devtools package installed, install it like so:

install.packages("devtools")

After that you can install and load MPRA Design Tools with these commands:

devtools::install_github('andrewGhazi/mpradesigntools')
library(mpradesigntools)

Use

This is the companion package to the MPRA Design Tools Shiny application available here: https://andrewghazi.shinyapps.io/designmpra/

The Shiny app allows users to interact with MPRA parameters (such as number of barcodes per allele) and see the effect of changing parameters on the assays power. Researchers can use this to decide what parameters best meet their experimental goals.

Currently the main function of MPRA Design Tools package is to design a set of barcoded sequences for MPRA experiments (without overloading our Shiny server!). This is done with the processVCF function. It takes roughly 5 seconds + 10ms per barcoded sequence on a relatively modern CPU, so you can estimate the expected job time in seconds as

5 + .01 * Number of barcodes per allele * Number of SNPs in VCF * 2 (for ref/alt alleles)

VCF Input constraints

Only the CHROM, POS, REF, and ALT columns are used. The INFO column is used only for detecting reverse strand constructs.

Current input constraints are:

  • Insertions and deletions must encode the reference and alternate alleles (respectively) as a dash character '-'.
  • Multiple alternate alleles should be separated in the ALT field by a comma and no spaces
  • By default, the program pulls the sequence context from the forward (+) strand of the reference genome. If the user wishes to generate SNPs for genes that normally are read from the reverse strand, add a string containing "MPRAREV" to the INFO field of the VCF. This will ensure that the genomic context gets inserted with the correct orientation relative to the minimal promoter and barcode in the reporter plasmid.
  • Alleles should be specified by the alleles present on the forward (+) strand. A small fraction of entries in official dbSNP VCFs are specified by their reverse strand alleles, which is denoted by the RV tag in the INFO field. These need to be flipped manually at the moment, automated handling is planned for a future release.

VCFs generated by batch querying rsID's on dbSNP should meet most of the formatting requirements. However the MPRAREV tag will need to be added by the user (where appropriate) because the VCF's do not always specify which strand the relevant gene is on.

Indel-correcting barcodes

9/17/18 - Feature under development

Alternative barcode sets may be used by specifying the barcode_set argument to processVCF one of the following values. The first number indicates the length of the barcodes in basepairs, the second indicates the number of errors correctable while still being able to identify the original barcode. Note that these barcodes CAN include miR seed sequences. If you want to avoid miR interference, identify the main miRs by abundance in your cell type of interest, then include their seed sequences in the filterPatterns argument. These barcodes are provided by the freebarcodes package, detailed at the publication below and available from the subsequently listed github repository.

The original barcode set provided with mpradesigntools is available as the twelvemers barcode set.

barcode_set n_barcodes
barcodes10-1 1902
barcodes10-2 30
barcodes11-1 6160
barcodes11-2 74
barcodes12-1 17213
barcodes12-2 178
barcodes13-1 56735
barcodes13-2 467
barcodes14-1 157196
barcodes14-2 1155
barcodes15-1 518508
barcodes15-2 3182
barcodes16-1 1636417
barcodes16-2 8776
barcodes17-2 23024
barcodes3-1 1
barcodes4-1 2
barcodes5-1 9
barcodes5-2 1
barcodes6-1 26
barcodes6-2 1
barcodes7-1 66
barcodes7-2 3
barcodes8-1 212
barcodes8-2 6
barcodes9-1 553
barcodes9-2 11
twelvemers 1140292

Indel-correcting DNA barcodes for high-throughput sequencing, John A. Hawkins, Stephen K. Jones, Ilya J. Finkelstein, William H. Press, Proceedings of the National Academy of Sciences Jul 2018, 115 (27) E6217-E6226; DOI: 10.1073/pnas.1802640115

https://github.com/finkelsteinlab/freebarcodes

Example

processVCF(vcf = '/path/to/the.vcf',
           nper = 14,
           upstreamContextRange = 55,
           downstreamContextRange = 55,
           outPath = '/path/to/the/output.tsv',
           fwprimer = 'ACTGGCCGCTTCACTG',
           revprimer = 'AGATCGGAAGAGCGTCG',
           alter_aberrant = TRUE,
           extra_elements = FALSE,
           max_construct_size = 170,
           barcode_set = 'barcodes14-1',
           ensure_all_4_nuc = TRUE)

Downstream analysis

Once you've performed your MPRA and have your sequencing results, check out malacoda for QC and statistical analysis of your results!

Planned Features

  • mm10 genomic context
  • parallelization
  • randomized alterations to aberrant digestion sites
  • bed file to Sharpr-MPRA library oligo design
  • automated handling of RV SNPs
  • Optimized barcode pools

If you are interested in a subset of these features or have other feature requests, please let us know to inform our implementation prioritization. You can do so by opening an issue on this repository or contacting the first and corresponding authors of the publication, listed above.

About

A tool for generating barcoded Massively Parallel Reporter Assay sequences


Languages

Language:R 100.0%