merlyescalona / simphycompress

Compresses SimPhy datasets into a single gzipped file for all the gene trees and gzipped mulitple sequence alignments for all the loci.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

© 2018 Merly Escalona (merlyescalona@uvigo.es)

Phylogenomics Lab, University of Vigo, Spain,

Build Status

SimPhy compress dataset

This program allows to compress the number of files and file sizes of a SimPhy run. Concatenating all loci sequences into a single multiple sequence alignment file for all the different FASTA outputs (sequences with gaps, or sequences without gaps (*_TRUE.fasta)). They are concatenated with N sequences (as long as desired - -n/--nsize parameter). Gene tree files are shrunked into a sinlge gzipped tab-separated file with 2 columns:

filename           tree

Where, filename is the basename of the gene tree file (e.g. g_trees00001.tree -> g_treees00001) and tree corresponds to the content of such file.

Assumptions

  • We are working under a SimPhy simulation. Follwing its hierarchical folder structure and sequence labeling.

To know more about the simulation pipeline scenario go to SimPhy's repository, and/or check:

Input

  • SimPhy folder path
  • prefix of the existing FASTA files
  • (optional) length of the N sequence that will be used to separate the sequences when concatenated

Output

  • Modifications are made INPLACE. Meaning, files are concatenated and gzipped in the same SimPhy folder. And so, the other files are removed.

Install

  • Clone this repository
git clone git@github.com:merlyescalona/simphycompress.git
  • Chance your current directory to the downloaded folder:
cd simphycompress
  • Install:
python setup.py install --user

Usage

Required arguments:

  • -s <path>, --simphy-path <path>: Path of the SimPhy folder.
  • -ip <input_prefix>, --input-prefix <input_prefix>: Prefix of the FASTA filenames.

Optional arguments:

  • -n <N_seq_size>, --nsize <N_seq_size>: Number of N's that will be introduced to separate the sequences selected. If the parameter is not set, the output file per replicate will be a multiple alignment sequence file, otherwise, the output will be a single sequence file per replicate consisting of a concatenation of the reference sequences selected separated with as many N's as set for this parameter.
  • -l <log_level>, --log <log_level>: Specified level of log that will be shown through the standard output. Entire log will be stored in a separate file.
    • Values:['DEBUG', 'INFO', 'WARNING', 'ERROR'].
    • Default: 'INFO'.

Information arguments:

  • -v, --version: Show program's version number and exit
  • -h, --help: Show this help message and exit

About

Compresses SimPhy datasets into a single gzipped file for all the gene trees and gzipped mulitple sequence alignments for all the loci.

License:GNU General Public License v3.0


Languages

Language:Python 98.5%Language:Shell 1.5%