neherlab / pan-genome-analysis

Processing pipeline for pan-genome visulization and exploration

Home Page:http://pangenome.de

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in step06: align genes in geneCluster by mafft and build gene trees

johanneswerner opened this issue · comments

I tried to compare Sulfurimonas genomes but the workflow didn't finish successfully.

./panX.py -fn data/BS_Sulfurimonas -sl Sulfurimonas -t 28

(...)

======  starting step06: align genes in geneCluster by mafft and build gene trees
Traceback (most recent call last):
  File "./panX.py", line 287, in <module>
    myPangenome.process_clusters()
  File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters
    myClusterCollector.estimate_raw_core_diversity()
  File "/data/tools/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
    self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
  File "/data/tools/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
    calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
  File "/data/tools/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
    with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/data/tools/pan-genome-analysis/data/BS_Sulfurimonas/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

Do you have any idea where the problem might originate and how I could solve it? If there is more information I can provide, please let me know.

I am including many draft genomes in my analysis. Could this be part of the problem?

draft genomes per se are not a problem. but incomplete or very diverged genomes are. Try rerunning with -cg 0.7 to use all genes present in >70% of genomes are core genes.

Thank you for the information. I re-run the code with the additional -cg 0.7 parameter, but now the workflow breaks here:

======  starting step08: run fasttree and raxml for tree construction
 fasttree time-cost:  4.51 minutes (270.51 seconds)
RAxML tree optimization within the timelimit of 30 minutes
RAxML branch length optimization and rooting
Traceback (most recent call last):
  File "./panX.py", line 303, in <module>
    myPangenome.build_core_tree()
  File "/data/tools/pan-genome-analysis/scripts/pangenome_computation.py", line 200, in build_core_tree
    aln_to_Newick(self.path, self.folders_dict, self.raxml_max_time, self.raxml_path, self.threads)
  File "/data/tools/pan-genome-analysis/scripts/sf_core_tree_build.py", line 75, in aln_to_Newick
    shutil.copy('RAxML_result.branches', out_fname)
  File "/data/miniconda3/envs/panX/lib/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/data/miniconda3/envs/panX/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'RAxML_result.branches'

Do you have any ideas?

pls check the raxml.log

Thank you for the information, see content of raxml.log below. Removing these sequences solved the problem.

Option -T does not have any effect with the sequential or parallel MPI version.
It is used to specify the number of threads for the Pthreads-based parallelization

RAxML can't, parse the alignment file as phylip file 
it will now try to parse it as FASTA file

ERROR: Sequence GCA_002742735.1_UBA10385_genomic consists entirely of undetermined values which will be treated as missing data
ERROR: Sequence GCA_002742775.1_UBA12504_genomic consists entirely of undetermined values which will be treated as missing data
ERROR: Found 2 sequences that consist entirely of undetermined values, exiting...