mgperry / treeCl

Clustering phylogenetic trees with python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TreeCl - Phylogenetic Tree Clustering

TreeCl is a python package for clustering gene families by phylogenetic similarity. It takes a collection of alignments, infers their phylogenetic trees, and clusters them based on a matrix of between-tree distances. Finally, it calculates a single representative tree for each cluster.

The purpose of this is to establish whether there is any underlying structure to the data.

Installation

Clone the repo with submodules using git clone --recursive git@github.com:kgori/treeCl.git and add it to your $PYTHONPATH

Dependencies

Python:

The easiest way to install the dependencies is using pip. If you don't have pip, it can be installed by typing easy_install pip in a shell. Then the above packages can be installed by running this command:

pip install numpy scipy dendropy scikit-learn

External:

Other:

Example Analysis

from treeCl.collection import Collection, Scorer
from treeCl.clustering import Clustering, Partition

c = Collection(input_dir='input_dir', file_format='phylip', datatype='protein') # add compression='gz' or 'bz2' if sequence alignments are compressed (zip not supported yet)
c.calc_NJ_trees() #add verbosity=1 or higher to get progress messages
dm = c.distance_matrix('euc')
cl = Clustering(dm)
p = cl.hierarchical(4, 'single') # should give fairly inaccurate clustering
true = Partition(tuple([1]*15+[2]*15+[3]*15+[4]*15))
sc = Scorer(c.records)
score = sc.score(p)
print score

About

Clustering phylogenetic trees with python


Languages

Language:Python 96.1%Language:C 3.7%Language:C++ 0.2%