neherlab / pan-genome-analysis

Processing pipeline for pan-genome visulization and exploration

Home Page:http://pangenome.de

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

File input

mdieser opened this issue · comments

Hi - Thanks for making this pipeline available! Quick question, are .gbk files the only accepted input file format?

Thanks for your time and have a great weekend,

Markus

Yes, currently we only support gbk. wouldn't be hard to relax this, but I don't have time right now.

richard

Thanks for your response. Just an FYI, I'm trying to use my own .faa and .fna files. Currently, I'm trying to spin the sf_extract_sequences.py script to get to correct .cpk output files for step 3. Also, if I want to make my own strain_list file are entries in a column format or tab delineated? Or, are these attempts completely off the track?

Markus

Hi Richard - is there a way around gbk files that link to contigs rather than listing the genes and CDSs? example: https://www.ncbi.nlm.nih.gov/nuccore/AMWD00000000.1/

not sure. You can always reannotate whatever sequences you have with PROKKA or similar.

Thanks for your response. Just an FYI, I'm trying to use my own .faa and .fna files. Currently, I'm trying to spin the sf_extract_sequences.py script to get to correct .cpk output files for step 3. Also, if I want to make my own strain_list file are entries in a column format or tab delineated? Or, are these attempts completely off the track?

Markus

There is a parameter "-ngbk" in the pipeline for that purpose.
One can use nucleotide/amino acid sequence files (fna/faa) as input with that option, when GenBank files are not available.
For example:
./panX.py -ngbk -fn ./data/TestSet -sl TestSet -t 64 > TestSet.log
Using the "-ngbk" option means that there will be no functional annotation of CDS (from the GenBank file) in the final results.
Besides, if one has contig sequences, GenBank files can be generated by Prokka using those as input.

Both suggestions worked. Deeply appreciate all the support! Thank you