niu-lab / gclust

genome sized sequences clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No output with the actual genome/contigs clusters sequence

VadimDu opened this issue · comments

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

Hi!
Thank you for your great suggestions. Considering that the large input genome will affect the I/O performance of the software, we wrote a shell script to run when needed. For specific examples, please refer to Example step 3 in the README.md file.

Best regards
Ruilin

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

I had the same question before.
I have written a similar script like example step 3 to solve this, with name "gclust2fa.py".

# get the clusters' representative sequences from fasta and glust_cluster_output
# usage: python3 gclust2fa.py raw.fa gclust.out cluster.fa
# input just like:
'''
>Cluster 0
0       5888230nt, >seq1... *
>Cluster 1
0       4800869nt, >seq2... *
>Cluster 2
0       3906592nt, >seq3... *
1       20nt, >seq4... at -/100.00%

'''

import sys

# input:
fasta_file = sys.argv[1]
# glust.out
clust_file = sys.argv[2]
# output:
outfa = sys.argv[3]
if fasta_file == outfa:
    exit()

representative_ctgs = dict()

i = 0
with open(clust_file) as f:
    for line in f:
        if line.startswith('>'):
            i += 1
        else:
            temp = line.rstrip().split()
            if temp[-1] == '*':
                representative = 1
            else:
                representative = 0
            if representative == 1:
                ctgname = temp[2].rstrip('.')
                representative_ctgs[ctgname] = ''

print("Representative number: " + str(i))

outFlag = 0
with open(outfa, 'w') as fout:
    with open(fasta_file) as f:
        for line in f:
            if line.startswith('>'):
                if line.rstrip() in representative_ctgs:
                    outFlag = 1
                else:
                    outFlag = 0
            if outFlag == 1:
                fout.write(line)