No output with the actual genome/contigs clusters sequence

Question

No output with the actual genome/contigs clusters sequence

VadimDu opened this issue 4 years ago · comments

Vadim (Dani) Dubinsky commented 4 years ago

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

sclirl · Answer 1 · Tue Dec 29 2020 18:01:27 GMT+0800 (China Standard Time)

Hi!
Thank you for your great suggestions. Considering that the large input genome will affect the I/O performance of the software, we wrote a shell script to run when needed. For specific examples, please refer to Example step 3 in the README.md file.

Best regards
Ruilin

Hongzhang Xue · Answer 2 · Mon Apr 12 2021 22:05:28 GMT+0800 (China Standard Time)

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

I had the same question before.
I have written a similar script like example step 3 to solve this, with name "gclust2fa.py".

# get the clusters' representative sequences from fasta and glust_cluster_output
# usage: python3 gclust2fa.py raw.fa gclust.out cluster.fa
# input just like:
'''
>Cluster 0
0       5888230nt, >seq1... *
>Cluster 1
0       4800869nt, >seq2... *
>Cluster 2
0       3906592nt, >seq3... *
1       20nt, >seq4... at -/100.00%

'''

import sys

# input:
fasta_file = sys.argv[1]
# glust.out
clust_file = sys.argv[2]
# output:
outfa = sys.argv[3]
if fasta_file == outfa:
    exit()

representative_ctgs = dict()

i = 0
with open(clust_file) as f:
    for line in f:
        if line.startswith('>'):
            i += 1
        else:
            temp = line.rstrip().split()
            if temp[-1] == '*':
                representative = 1
            else:
                representative = 0
            if representative == 1:
                ctgname = temp[2].rstrip('.')
                representative_ctgs[ctgname] = ''

print("Representative number: " + str(i))

outFlag = 0
with open(outfa, 'w') as fout:
    with open(fasta_file) as f:
        for line in f:
            if line.startswith('>'):
                if line.rstrip() in representative_ctgs:
                    outFlag = 1
                else:
                    outFlag = 0
            if outFlag == 1:
                fout.write(line)