martindemko / tandem-repeats-merger

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tandem Repeats Merger

Set of scripts for modifying output of Tandem Repeats Finder (TRF).

Finds candidate telomeric sequences in NGS data output of TRF.

Tested on Ubuntu 16.04 with Python 2.7.

Either you can run TRM along with TRF starting with the .fasta files, or if you already have NGS output data from TRF, you can run the TRM only.

This version is primarly used for Galaxy's toolshed repository definition. But can be used on command-line as well, just follow the README.

How to run together with TRF

  1. Place your data in .fasta format into one folder (e.g. ./data/)

  2. Download Tandem Repeats Finder from https://tandem.bu.edu/trf/ and place it into this folder. If your binary is not named trf407b.linux64 or you want to use different path than $PWD, modify iterateTRF.sh.

Better solution is to use conda with the env.yaml configuration file. Just call conda create env -f ./env.yaml -n trm-env and after the installation process call source activate trm-env.

  1. Change the variable dataDir inside the ./scripts/runAllWithTRF.sh to point into your directory with inpout data. You may also want to change the default name of output data (variable shortName). In the very same script, one can see the default settigns of other input parameters. They can be changed inside the script or sent from command line as follows: ./runAllWithTRF.sh 3 4 2 7 7 80 10 50 15 2 90 0 -h. It will create specific folder structure.

How to run without TRF

  1. Assuming you already have TRF's NGS output data, you should place them into ./scripts/res/TRF\_res directory with .dat extension.

  2. You may also change the variable myDir inside the ./scripts/runAllNoTRF.sh script so you can place your input data accordingly into ${myDir}/TRF\_res directory.

  3. This particular script has much less input paramaters to set. They can be changed inside the script or sent from command line as follows: ./runAllNoTRF.sh 3 4 90 0. It will create specific folder structure.

Explain the input parameters

All the input parameters are contained together in the runAllWithTRF.sh script so we use here the explanation from there (so far, they must be used in the specified order and in the right place):

  • minNumberOfRepeats="3" ... min number of repeats
  • minLengthOfPattern="4" ... min length of repeating pattern
  • trf_match="2" ... TRF's matching weight
  • trf_mism="7" ... TRF's mismatching penalty
  • trf_delta="7" ... TRF's indel penalty
  • trf_pm="80" ... TRF's match probability (whole number)
  • trf_pi="10" ... TRF's indel probability (whole number)
  • trf_min="50" ... TRF's minimum alignment score to report
  • trf_max="15" ... TRF's maximum period size to report
  • trf_longest="2" ... TRF's maximum TR length expected (in millions)
  • readLength="90" ... for restrZeros.py
  • relOccur="0" ... if yes, the value must be 1 otherwise it is preset to 0
  • trf_html="" ... TRF's html output; if you want to supress it change the value to '-h'

Explain specific output folder structure

  • res ... predifined output directory name (can be changed in the variable myDir in the scripts runAllWithRTF.sh and runAllNoTRF.sh)
    • parsed
      • dataset_6484_ppr.txt ... intermediate file
      • dataset_6485_ppr.txt ... intermediate file
      • dataset_6486_ppr.txt ... intermediate file
      • res
        • dataset_6484_ppr_sorted.txt ... intermediate file
        • dataset_6485_ppr_sorted.txt ... intermediate file
        • dataset_6486_ppr_sorted.txt ... intermediate file
        • joined_fixed_pairedReverseComplement_merged_sorted_FINAL.txt ... FINAL output file with reverse-complement-paired sequences of tandem repeats with number of occurrences in the input datasets
        • joined_fixed_pairedReverseComplement_merged_sorted.txt ... intermediate file
        • joined_fixed_pairedReverseComplement_merged.txt ... intermediate file
        • joined_fixed_pairedReverseComplement.txt ... intermediate file
        • joined_fixed.txt ... intermediate file
        • joined_fixed_without_pairedReverseComplement_sorted_FINAL.txt ... FINAL output file sorted according to the number of occurrences of tandem repeats in the input datasets
        • joined_fixed_without_pairedReverseComplement_sorted.txt ... intermediate file
        • joined_fixed_without_pairedReverseComplement.txt ... intermediate file
        • joined.txt ... intermediate file
    • TRF_res ... directory containing all TRF outputs (either it is filled automatically (case of runAllWithTRF.sh), or you must copy your input here (case of runAllNoTRF.sh)
      • dataset_6484.dat ... NGS data from TRF
      • dataset_6485.dat ... NGS data from TRF
      • dataset_6486.dat ... NGS data from TRF

About

License:MIT License


Languages

Language:Python 49.9%Language:Shell 37.5%Language:HTML 12.6%