Roary output for downstream panX pipeline

Question

Roary output for downstream panX pipeline

wshropshire opened this issue 6 years ago · comments

William Shropshire commented 6 years ago

Hello,

I have had an issue using my output from Roary with your tool. I have used both a 95 sample and smaller sample (n=10 with compliant GeneBank settings from Prokka-1.13.3) and had similar issues when running steps 1 - 11 omitting step 2. There appears to be a potential hardcoding issue with Diamond, at least based on how I'm interpreting the error message, which because of my novice coding expertise, I have not been able to figure out. It seems that before the MAFFT step, the pipeline is expecting a tmp file from a Diamond output: tmp_core_diversity.txt. Here is the output from my error file, which is what is returned with the use of either Roary output clustered protein file:

Traceback (most recent call last):
File "./panX.py", line 287, in
myPangenome.process_clusters()
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/pangenome_computation.py", line 180, in process_clusters
myClusterCollector.estimate_raw_core_diversity()
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
File "/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/data/HTX_Kpn_Roary/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

Any feedback would be appreciated on how to resolve this issue! Thanks!

Richard Neher · Answer 1 · Mon Nov 12 2018 23:25:00 GMT+0800 (China Standard Time)

Hi William,

thanks for your note. I discussed this with Wei. Possibly the issue is not so much a missing file (even though the file path might look that way) but no genes that look like they are core genes. This happens frequently with roary of the genomes are moderately diverged or incomplete. Could you check how much of the path exists? And maybe send the directory tree of your output dir?

thanks,
richard
ps: but we certainly need better error handling here....

William Shropshire · Answer 2 · Tue Nov 13 2018 08:13:37 GMT+0800 (China Standard Time)

Hello thanks for your quick response,

So the pipeline generates the amino acid fasta files, however, it creates the tmp_core directory, but does not create a tmp_core_diversity.txt file. Considering that I have used this data set for phylogenetic analysis using RAxML and FastTree and these are fairly homologous genomes, I don't think it's a divergence/incomplete issue. Furthermore, with the reduced set, I still get a core genome of ~4000 genes. Although the test set using the smaller sample is what is throwing this error (since I realized the initial issue was not having 'compliant' GeneBank annotations, which Roary is agnostic to), I still believe it might be a directory issue.

The directory tree of my output directory is this (If this is indeed, what you're looking for, if not, please correct me):

-rw-rw-r-- 1 wshropshire sbmmgadlp001-users 634K Nov 10 18:53 clustered_proteins
drwxrwsr-x 2 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:54 geneCluster
-rw-rw-r-- 1 wshropshire sbmmgadlp001-users 3.6M Nov 10 18:59 geneID_to_description.cpk
-rw-rw-r-- 1 wshropshire sbmmgadlp001-users 3.6M Nov 10 18:59 geneID_to_geneSeqID.cpk
drwxrwsr-x 2 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:54 input_GenBank
drwxrwsr-x 2 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:54 log
-rw-rw-r-- 1 wshropshire sbmmgadlp001-users 515 Nov 10 18:59 metainfo.tsv
drwxrwsr-x 2 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:59 nucleotide_fna
drwxrwsr-x 3 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:59 protein_faa
drwxrwsr-x 2 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:54 RNA_fna
-rw-rw-r-- 1 wshropshire sbmmgadlp001-users 88 Nov 10 18:59 strain_list.cpk
drwxrwsr-x 2 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:54 tmp_core
drwxrwsr-x 3 wshropshire sbmmgadlp001-users 4.0K Nov 10 18:54 vis

Richard Neher · Answer 3 · Wed Nov 14 2018 19:51:48 GMT+0800 (China Standard Time)

Wei was kind enough to generate a roary test set. If you check-out branch roary-test, you'll find a small number of genomes and the roary clusters in the data folder. could you please run run-Pm_roary.sh and check whether it produces the desired output?

William Shropshire · Answer 4 · Thu Nov 15 2018 01:29:11 GMT+0800 (China Standard Time)

Hello Richard,

So Wei's test set ran beautifully and did not throw out any errors (Also please thank him for writing that script!). It did give me insight to what may be the problem for my test set. It appears that Wei's test set was annotated with NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) which is much cleaner than Prokka. When I executed the same command line that was used for his roary test set on mine, I looked at the log file and it appears although I have used the "compliant" option for Prokka, it still does not recognize this format as being GeneBank compliant:

`
====== starting step05: cluster proteins
./data/HTX_Kpn_Roary/clustered_proteins /data/opt/programs/etc/panX/v1.6.0/pan-genome-analysis-1.6.0/d$
C234_01598 non-coding gene from Roary file (skipped)
...
...
...
C244_03336 non-coding gene from Roary file (skipped)
====== time for step05:
0.00 minutes (0.24 seconds)

====== starting step06: align genes in geneCluster by mafft and build gene trees
`

Here is an example of a gff file that is "GeneBank compliant" prokka so that y'all can check it out for yourself:
C234.gff.zip

Anyways, I'm fairly certain it's an issue with compatibility with Prokka since that was the first recognized error I had to begin with. Thank you for spending time to look into this!

Richard Neher · Answer 5 · Thu Nov 15 2018 04:22:11 GMT+0800 (China Standard Time)

My guess is that your problem has something to do with truncation or special characters in locus tags. PanX will try to link roary clusters to protein coding genes by looking up their names in a dictionary:

https://github.com/neherlab/pan-genome-analysis/blob/master/scripts/sf_cluster_protein.py#L150

the names are extracted by some split operation on the compound name. If somehow this lookup fails, it won't find any clusters down the line.

William Shropshire · Answer 6 · Thu Nov 15 2018 10:08:34 GMT+0800 (China Standard Time)

It was creating locus tags based on the sample name that was throwing it off. It is working now, thank you (and Wei) so much!

Cameron Reid · Answer 7 · Mon Mar 16 2020 08:17:53 GMT+0800 (China Standard Time)

@wshropshire I've got the same problem as you, what did you change the locus tags to so that it worked? Does the same gene sequence between two samples need to have the same locus tag?

William Shropshire · Answer 8 · Mon Mar 16 2020 09:20:04 GMT+0800 (China Standard Time)

If you are using prokka, just use the prokka '--compliant' option and that should created standardized locus tags that should fix downstream problems. At this point, I forget exactly why this was creating errors in the pan-genome-analysis pipeline, but I believe it was a parsing error.

Cameron Reid · Answer 9 · Mon Mar 16 2020 10:01:14 GMT+0800 (China Standard Time)

I've just generated prokka output with the --compliant option and the locus tags are identical to those created without the option. I've tried replacing the prokka generated has-based tags with PROKKA_ but panX didn't like that either. Could it simply be that it doesn't like underscores?

William Shropshire · Answer 10 · Wed Mar 18 2020 01:10:14 GMT+0800 (China Standard Time)

I'd look and see if there is a discrepancy in the locus tags and what appears in the gene presence/absence csv file. This is what the headers look like in my clustered_proteins file generated by panX:

My guess is that the prefix 'PROKKA' is the issue