No output with the actual genome/contigs clusters sequence
VadimDu opened this issue · comments
Dear developers,
Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.
Thank you and best regards
Vadim
Hi!
Thank you for your great suggestions. Considering that the large input genome will affect the I/O performance of the software, we wrote a shell script to run when needed. For specific examples, please refer to Example step 3 in the README.md file.
Best regards
Ruilin
Dear developers,
Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.Thank you and best regards
Vadim
I had the same question before.
I have written a similar script like example step 3 to solve this, with name "gclust2fa.py".
# get the clusters' representative sequences from fasta and glust_cluster_output
# usage: python3 gclust2fa.py raw.fa gclust.out cluster.fa
# input just like:
'''
>Cluster 0
0 5888230nt, >seq1... *
>Cluster 1
0 4800869nt, >seq2... *
>Cluster 2
0 3906592nt, >seq3... *
1 20nt, >seq4... at -/100.00%
'''
import sys
# input:
fasta_file = sys.argv[1]
# glust.out
clust_file = sys.argv[2]
# output:
outfa = sys.argv[3]
if fasta_file == outfa:
exit()
representative_ctgs = dict()
i = 0
with open(clust_file) as f:
for line in f:
if line.startswith('>'):
i += 1
else:
temp = line.rstrip().split()
if temp[-1] == '*':
representative = 1
else:
representative = 0
if representative == 1:
ctgname = temp[2].rstrip('.')
representative_ctgs[ctgname] = ''
print("Representative number: " + str(i))
outFlag = 0
with open(outfa, 'w') as fout:
with open(fasta_file) as f:
for line in f:
if line.startswith('>'):
if line.rstrip() in representative_ctgs:
outFlag = 1
else:
outFlag = 0
if outFlag == 1:
fout.write(line)