UpalabdhaD / tidy-ncbi-genome-download

Collect and filter genomes downloaded using [ncbi-genome-downloa](https://github.com/kblin/ncbi-genome-download), change file names to safe names for down-stream processing (phylogenetic tree making etc.)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool



This is a tool set dealing with downloaded genomes from NCBI using ncbi-genome-download.

The purpose of these tools is to connect semi-automatically to downstream tools. For example phylophlan and mashtree, which needs a set of genome files in one folder, uses the file names in their output, and raises error if there are rnaccepted characters (eg. ( or space, regexp = [ _:,();{}+*'\"[\]\/\t\n]+) in file names.

Download target genomes from NCBI.

Please refer to ncbi-genome-download and download genomes of interest with -m switch (--metadata-table). Example:

ncbi-genome-download -F genbank -g "Streptomyces,Kitasatospora" -H -p 10 -r 3 -m ncbiftp-Streptomyces-Kitasatospora-gbk.tsv -o ncbiftp-Streptomyces-Kitasatospora-gbk bacteria

Support download using -F genbank/fasta/protein-fasta.

The -m switch (--metadata-table) is required.

The -H switch is optional,


usage: gather_assemblies.py [-h] [--excludeList EXCLUDELIST] [--maxCtg MAXCTG] [--targetDir TARGETDIR] tsv dir

positional arguments:
  tsv                   Path to the .tsv file generated by `-m` switch
  dir                   Path to the directory generated by `-o` parameter

  -h, --help            show this help message and exit
  --excludeList EXCLUDELIST
                        Exclusion list file, one item per line
  --maxCtg MAXCTG       Maximum number of contigs that a genome will be kept.
  --targetDir TARGETDIR
                        Valid assemblies will be copied to this directory.

This script checks the information in the .tsv file, parse strain names from the file, remove duplicated genome for single strain, change file name to the species + strain name format (eg. "Streptomyces_coelicolor_A3_2_ICSSB_1010.fna.gz"). If --macCtg option is set, also checks the number of sequences in each downloaded genome, discard those genomes with more than this number of contigs.

Note you can NOT set --maxCtg when protein fasta files are downloaded (since each protein is a single sequence that is counted as one 'contig').

A exclusion list can be set for known duplicates of strains. The exclusion list is a text file of tab delimited table. First column is the name of the strain, second column is the accession to be excluded:

strain accession
Streptomyces coelicolor M1154
Streptomyces coelicolor A3(2) R4-mCherry

The file will look like:

Streptomyces coelicolor M1154
Streptomyces coelicolor A3(2) R4-mCherry

Note the program will try to match both accession and strain name if they are both set in the same line.

check_combine.py and combine_database.py

These two scripts check validity of file names if we want to combine database from other sources (combine a folder with another or many others) :

usage: check_combine.py [-h] p [p ...]

positional arguments:
  p           pathes of databases (folders) you want to combine

  -h, --help   show this help message and exit

The script will first change the file names to "safe names" and then check if there are duplicated files in all directories. Then it will print out the checking result.

After you have checked the possible operation, do the actual combining:

usage: combine_database.py [-h] [-t T] [--keep KEEP] p [p ...]

positional arguments:
  p            pathes of databases (folders) you want to combine

  -h, --help   show this help message and exit
  -t T         target dir to store combined files
  --keep KEEP  If duplicated file names found, keep "first" or "all"


Collect and filter genomes downloaded using [ncbi-genome-downloa](https://github.com/kblin/ncbi-genome-download), change file names to safe names for down-stream processing (phylogenetic tree making etc.)

License:Apache License 2.0


Language:Python 100.0%