morrislab / rnascan

Python package for scanning RNA sequences with sequence and structure PFMs

Home Page:http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete-S/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status Version GitHub license

rnascan

rnascan is a (mostly) Python suite to scan RNA sequences and secondary structures with sequence and secondary structure PFMs. Secondary structure is represented as weights in different secondary structure contexts, similar to how a PFM represents weights of different nucleotides or amino acids. This allows representation and use of secondary structures in a way that is similar to how PFMs are used to scan nucleotide sequences, and also allows for some flexibility in the structure, as you might find in the boltzmann distribution of secondary structures.

The secondary structure alphabet is as follows:

  • B - bulge loop
  • E - external (unpaired) RNA
  • H - hairpin loop
  • L - left paired RNA (i.e., a '(' in dot-bracket format)
  • M - multiloop
  • R - right paired RNA (i.e., a ')' in dot-bracket format)
  • T - internal loop

The rnascan suite consists of two tools:

  1. run_folding: Calculate an average structural context profile of an RNA sequence by folding overlapping 100 nt subsequences and averaging across.
  2. rnascan: Scan RNA sequences and secondary structures with sequence and secondary structure PFMs.

Installation

Read the following steps to install rnascan. If you do not plan on using the run_folding tool to fold sequences, you may skip the steps with an asterisk (*).

1. Install ViennaRNA*

To predict secondary structures, the program RNAfold from the ViennaRNA package is used. Please follow the installation instructions on their website.

2. Download rnascan source

git clone git@github.com:morrislab/rnascan.git
cd rnascan

3. Compile secondary structure parser C++ script*

The compiled binary must be saved in a location where it can be executed (i.e. is listed in your PATH environment variable). Here, we use the user's local bin:

g++ -o ~/bin/parse_secondary_structure scripts/parse_secondary_structure.cpp

4. Install rnascan Python components

This package requires Python 2.7+ or Python 3.5+. To install the package, run the following:

python setup.py install

# alternatively, for user-specific installation:
python setup.py install --user

Dependencies (pandas, numpy, and biopython) will be automatically downloaded and installed, if necessary.

Usage

For full documentation of options, refer to the help messages using the -h option for each command.

run_folding

run_folding sequences.fasta /path/to/output_dir

The second argument /path/to/output_dir is the directory where the average structure profiles will be saved. One file per FASTA record will be outputted.

rnascan

Scanning can be performed in four modes:

  1. Sequence only (using -p to specify the sequence PFM)
  2. Structure only (using -q to specify the structure PFM)
  3. Sequence and structure (-p and -q)
  4. Sequence and averaged structure (-p and -q)

Here are some example commands using minimal options:

# To run a test sequence
rnascan -p pfm_seq.txt -t AGTTCCGGTCCGGCAGAGATCGCG > hits.tab

# Sequence-only (use -p)
rnascan -p pfm_seq.txt sequences.fasta > hits.tab

# Structure-only (use -q)
rnascan -q pfm_struct.txt structures.fasta > hits.tab

# Sequence and structure
rnascan -p pfm_seq.txt -q pfm_struct.txt sequences.fasta structures.fasta > hits.tab

# Sequence and averaged structure
rnascan -p pfm_seq.txt -q pfm_struct.txt sequences.fasta averaged_structures/ > hits.tab

Note that in the last example, the second positional argument is the path to a directory containing the average structure profiles generated by run_folding. rnascan will look inside the directory and automatically search for files that look like structure.<sequence_id>.txt.

To print the score at every position, change the default threshold using the -m option to -inf. To change the number of processing cores, use -c:

rnascan -p pfm_seq.txt -q pfm_struct.txt -m ' -inf' -c 8 sequences.fasta averaged_structures/ > hits.tab

Computing background probabilities

By default, rnascan computes the background probabilities from the input sequences at the beginning of the run. To apply a uniform background, use the option -u:

rnascan -p pfm_seq.txt -u sequences.fasta > hits.tab

To compute the background probabilities of a set of input sequences and save it for future use, use the option --bgonly:

rnascan -p pfm_seq.txt --bgonly sequences.fasta > background.txt

rnascan -q pfm_struct.txt --bgonly structures.fasta > background.txt

In this mode, rnascan computes the background probabilities, outputs to standard output (in the form of a Python dictionary), and exits (no scanning is performed). To re-use this background later, use the option --bg_seq or --bg_struct with the background file:

rnascan -p pfm_seq.txt --bg_seq background.txt sequences.fasta > hits.tab

Citation

Cook, K.B., Vembu, S., Ha, K.C.H., Zheng, H., Laverty, K.U., Hughes, T.R., Ray, D., Morris, Q.D., 2017. RNAcompete-S: Combined RNA sequence/structure preferences for RNA binding proteins derived from a single-step in vitro selection. Methods 126, 18–28. http://www.sciencedirect.com/science/article/pii/S1046202317300312

Links

About

Python package for scanning RNA sequences with sequence and structure PFMs

http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete-S/index.html

License:GNU Affero General Public License v3.0


Languages

Language:Python 80.4%Language:C++ 11.1%Language:C 8.3%Language:Shell 0.1%