TreeCl is a python package for clustering gene families by phylogenetic similarity. It takes a collection of alignments, infers their phylogenetic trees, and clusters them based on a matrix of between-tree distances. Finally, it calculates a single representative tree for each cluster.
The purpose of this is to establish whether there is any underlying structure to the data.
Clone the repo with submodules using
git clone --recursive git@github.com:kgori/treeCl.git
and add it to your $PYTHONPATH
- numpy (v1.6.2)
- scipy (v0.11.0)
- dendropy (v3.12.0)
- scikit-learn (v0.12.1)
- biopython (v1.60) optional - for k-medoids clustering only
The easiest way to install the dependencies is using pip. If you don't have pip,
it can be installed by typing easy_install pip
in a shell.
Then the above packages can be installed by running this command:
pip install numpy scipy dendropy scikit-learn
- GTP - a java program for calculating geodesic distances - see A Fast Algorithm for Computing Geodesic Distances in Tree Space
from treeCl.collection import Collection, Scorer
from treeCl.clustering import Clustering, Partition
c = Collection(input_dir='input_dir', file_format='phylip', datatype='protein') # add compression='gz' or 'bz2' if sequence alignments are compressed (zip not supported yet)
c.calc_NJ_trees() #add verbosity=1 or higher to get progress messages
dm = c.distance_matrix('euc')
cl = Clustering(dm)
p = cl.hierarchical(4, 'single') # should give fairly inaccurate clustering
true = Partition(tuple([1]*15+[2]*15+[3]*15+[4]*15))
sc = Scorer(c.records)
score = sc.score(p)
print score