ATPs / PrecisionProDB

generate personalized reference protein sequences for proteome search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PrecisionProDB

PrecisionProDB (Precision protein database), a tool improving the proteomics performance for precision medicine.

PrecisionProDB is a Python package for proteogenomics, which can generate a customized protein database for peptide search in mass spectrometry.

Description

The major goal of PrecisionProDB is to generate personized protein sequences for protein identification in mass spectrometry (MS). Main features:

  • Supports multithreading, which improves the speed of the program. A typical customized human protein database can be generated in 15 to 20 mins using 8 threads.
  • Optimized for several widely used human gene models, including:
    • GENCODE: PrecisionProDB can download the latest version from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human, as shown in https://www.gencodegenes.org/human/.
    • RefSeq: PrecisionProDB can download the latest version from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/current
    • Ensembl: PrecisionProDB can download the latest version from:
      • ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
      • ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/
      • ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/
    • UniProt: PrecisionProDB can download the latest version from:
      • ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/
      • The files are UP000005640/UP000005640_9606.fasta.gz and UP000005640/UP000005640_9606_additional.fasta.gz, which may change in the future.
  • The non-standard codons and rare amino acids (e.g. Selenocysteine (Sec or U)) in the human genome can be properly incorporated.
  • Internal stops (*) in proteins were reserved.
  • Supports variant file in text or VCF format.
  • All input files can be in compressed gzip (.gz) format.
  • Supports user generated gene models in GTF/GFF format. Species other than human are also supported.

The figure below shows how PrecisionProDB works:

Installation

Install required packages with conda

PrecisionProDB is tested under the base enviroment of Anaconda. It requires Python3, Biopython and Pandas. If Anaconda is installed, only Biopython need to be installed:

conda install -c anaconda biopython

Otherwise, it is recommended to use the conda to control the packages and virtual environement. Install required packages:

conda install numpy
conda install pandas
conda install -c anaconda biopython

Install required packages with pip

If conda is not installed, pip (or pip3 as Python3 is required) can be used. pip is already installed if you are using Python3 >=3.4 downloaded from python.org.

pip3 install numpy
pip3 install pandas
pip3 install biopython

If the user has no root previlige on the system, the packages can be installed using the "--user" option:

pip3 install numpy --user USER
pip3 install pandas --user USER
pip3 install biopython --user USER

USER is the user name on the operating system to install these packages.

Install PrecisionProDB

To install the latest developments:

git clone https://github.com/ATPs/PrecisionProDB.git

To install other verisons, download from the release page directly.

Citing PrecisionProDB

Xiaolong Cao, Jinchuan Xing, PrecisionProDB: improving the proteomics performance for precision medicine, Bioinformatics, Volume 37, Issue 19, October 2021, Pages 3361–3363, https://doi.org/10.1093/bioinformatics/btab218

Usage Information

Note: python in the example scripts below are Python3. If you are unsure about the version of your python, use python --version to show the version. In some systems you might need to use python3 to specify Python3, or use the full name of Pythons (e.g., /home/xcao/p/anaconda3/bin/python3.7), if multiple versions of Python exist in the system or Python is not in the system PATH.

Typical usage

The most simple case

We suppose that in most cases, users will have a variant file in VCF format. If there is only one sample in the VCF file, the simplest command will be:

python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output
  • -m Name_of_variant_file defines the input variant file (include the full path if the input file is not in the current folder). If the variant file ends with '.vcf' (case ignored), it will be treated as a VCF file. In all other cases a file will be treated as a TSV file. Files end with '.gz' (e.g., '.vcf.gz' or '.tsv.gz') will be treated as gzip compressed files.
  • -D GENCODE defines the annotation reference to be used. In this example, personalized protein sequences based on the GENCODE annotation will be generated. PrecisionProDB will download required files of GENCODE models automatically. To use gene models in other supported resources, GENCODE could be changed to RefSeq, Ensembl or Uniprot.
  • -o Prefix_of_output defines the prefix of the output filenames.

VCF with multiple samples

If there are multiple samples in the VCF file, the -s option should be used to specify the sample name to be used in the VCF file.

python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output -s Sample_name

VCF with local gene annotation

If there is a local version of gene annotation files from Ensembl, the command will be:

python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -o Prefix_of_output -s Sample_name -g Ensembl_Genome -p Ensembl_protein -f Ensembl_gtf -a Ensembl_GTF

Ensembl_Genome, Ensembl_protein, and Ensembl_gtf are the locations of the Ensembl genome, protein, and GTF files, respectively. These files can be downloaded from Ensembl website as metioned previously, or use the downloadHuman module in the package.

python Path_of_PrecisionProDB/src/downloadHuman.py -d Ensembl -o Output_folder

Output_folder is the path of output folder to store the downloaded files.

Variant file in text format

If the variant file is in the tab-separated values (TSV) format,

  • it needs to include a header row, with at least four columns: chr, pos, ref, alt. There is no requirement for the order of these columns, as pandas was used to parse the file.

  • additional columns are allowed, but will be ignored.

  • the chr, pos, ref and alt columns were coded in the VCF format. This means that for deletions, it should be written as chr1 10146 AC A, rather than chr1 10147 C . . Also, the pos is 1-based like in the VCF file, not 0-based (in bed file).

  • The most simple text file looks like:

    chr pos ref alt
    1 10146 AC A
    1 15274 A G
    1 28563 A G
    1 49298 T C
    1 52238 T G
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output
  • Name_of_variant_file is the name of the variant file. If the variant file ends with '.vcf' (case ignored), it will be treated as a VCF file, as described above. In all other cases a file will be treated as a TSV file. Files end with '.gz' (e.g., '.vcf.gz' or '.tsv.gz') will be treated as gzip compressed files.
  • For text file format input, -s option will be ignored as there is only one sample.
  • Here, -D is set to be GENCODE. GENCODE related files will be downloaded.

User-provided gene models

We tested GTF annotation generated by TransDecoder.
Run TransDecoder in the starting from a genome-based transcript structure GTF file mode.

python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -o Prefix_of_output -s Sample_name -g TransDecoder_Genome -p TransDecoder_protein -f TransDecoder_gtf -a gtf

Testing with example files

Variant file in VCF format

cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m celline.vcf.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o vcf_variant

Five files will be generated in the examples folder.

  • vcf_variant.pergeno.aa_mutations.csv: annotations of amino acid changes.
  • vcf_variant.pergeno.protein_all.fa: all proteins after incoporating the variants.
  • vcf_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incoporating the variants.
  • vcf_variant.vcf2mutation_1.tsv: variant file extracted from the VCF file in text format, the first alternative alleles.
  • vcf_variant.vcf2mutation_2.tsv: variant file extracted from the VCF file in text format, the second alternative alleles.

Note:

  • For altered proteins, __1, __2, __12 will be added to the ID of the protein.
    • __1 and __2 mean that the alleles of the protein is from the first and the second variant file, respectively.
    • __12 means that the the altered protein sequence are the same for the first and the second alleles.
    • e.g., >ENSP00000308367.7|ENST00000312413.10|ENSG00000011021.23|OTTHUMG00000002299|-|CLCN6-201|CLCN6|847__12 changed, ENSP00000263934.6|ENST00000263934.10|ENSG00000054523.18|OTTHUMG00000001817|OTTHUMT00000005103.1|KIF1B-201|KIF1B|1770__2 changed, ENSP00000332771.4|ENST00000331433.5|ENSG00000186510.12|OTTHUMG00000009529|OTTHUMT00000026326.1|CLCNKA-201|CLCNKA|687__1 changed, ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094|OTTHUMT00000003223.1|OR4F5-202|OR4F5|326 unchanged.
  • The variant file looks like
    chr     pos     ref     alt
    chr1    52238   T       G
    chr1    53138   TAA     T
    chr1    55249   C       CTATGG
    chr1    55299   C       T
    chr1    61442   A       G
    

Variant file in text format

cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m gnomAD.variant.txt.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o text_variant

Three files will be generated in the examples folder.

  • text_variant.pergeno.aa_mutations.csv: amino acid change annotations
  • text_variant.pergeno.protein_all.fa: all proteins after incoporating the variants.
  • text_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incoporating the variants.

Note:

  • Protein names and descriptions in the fasta file are the same as in the input protein file, and the Tab symbol (\t) + changed or unchanged were added to indicate if the protein sequence is altered.
  • e.g., ENSP00000328207.6|ENST00000328596.10|ENSG00000186891.14|OTTHUMG00000001414|OTTHUMT00000004085.1|TNFRSF18-201|TNFRSF18|255 unchanged, ENSP00000424920.1|ENST00000502739.5|ENSG00000162458.13|OTTHUMG00000003079|OTTHUMT00000368044.1|FBLIM1-210|FBLIM1|144 changed.

Get help information for each module

There are several files in the src folder. Each of them were designed in a way that can be run independently. To get help, run

python Path_of_PrecisionProDB/src/module_name.py -h

To get help for the main program, run

python Path_of_PrecisionProDB/src/PrecisionProDB.py -h

The following messages will be printed out.

usage: PrecisionProDB.py [-h] [-g GENOME] [-f GTF] -m MUTATIONS [-p PROTEIN] [-t THREADS] [-o OUT] [-a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}]
                         [-k PROTEIN_KEYWORD] [-F] [-s SAMPLE] [-A] [-D {GENCODE,RefSeq,Ensembl,Uniprot,}] [-U UNIPROT] [--uniprot_min_len UNIPROT_MIN_LEN]

PrecisionProDB, a personal proteogenomic tool which outputs a new reference protein based on the variants data. 
A VCF or a tsv file can be used as the variant input. 
If the variant file is in tsv format, at least four columns are required in the
header row: chr, pos, ref, alt. Additional columns will be ignored. Convert the file to proper format if you have a bed file or other types of variant file.

optional arguments:
  -h, --help            show this help message and exit
  -g GENOME, --genome GENOME
                        the reference genome sequence in fasta format. It can be a gzip file
  -f GTF, --gtf GTF     gtf file with CDS and exon annotations. It can be a gzip file
  -m MUTATIONS, --mutations MUTATIONS
                        a file stores the variants. If the file ends with ".vcf" or ".vcf.gz", treat as vcf input. Otherwise, treat as TSV input
  -p PROTEIN, --protein PROTEIN
                        protein sequences in fasta format. It can be a gzip file. Only proteins in this file will be checked
  -t THREADS, --threads THREADS
                        number of threads/CPUs to run the program. default, use all CPUs available
  -o OUT, --out OUT     output prefix, folder path could be included. Three or five files will be saved depending on the variant file format. Outputs include the
                        annotation for mutated transcripts, the mutated or all protein sequences, two variant files from vcf. {out}.pergeno.aa_mutations.csv,
                        {out}.pergeno.protein_all.fa, {out}.protein_changed.fa, {out}.vcf2mutation_1/2.tsv. default "perGeno"
  -a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}, --datatype {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}
                        input datatype, could be GENCODE_GTF, GENCODE_GFF3, RefSeq, Ensembl_GTF or gtf. default "gtf". Ensembl_GFF3 is not supported.
  -k PROTEIN_KEYWORD, --protein_keyword PROTEIN_KEYWORD
                        field name in attribute column of gtf file to determine ids for proteins. default "auto", determine the protein_keyword based on datatype.
                        "transcript_id" for GENCODE_GTF, "protein_id" for "RefSeq" and "Parent" for gtf and GENCODE_GFF3
  -F, --no_filter       default only keep variant with value "PASS" FILTER column of vcf file. if set, do not filter
  -s SAMPLE, --sample SAMPLE
                        sample name in the vcf to extract the variant information. default: None, extract the first sample
  -A, --all_chromosomes
                        default keep variant in chromosomes and ignore those in short fragments of the genome. if set, use all chromosomes including fragments when
                        parsing the vcf file
  -D {GENCODE,RefSeq,Ensembl,Uniprot,}, --download {GENCODE,RefSeq,Ensembl,Uniprot,}
                        download could be 'GENCODE','RefSeq','Ensembl','Uniprot'. If set, PrecisonProDB will try to download genome, gtf and protein files from the
                        Internet. Download will be skipped if "--genome, --gtf, --protein, (--uniprot)" were all set. Settings from "--genome, --gtf, --protein,
                        (--uniprot), --datatype" will not be used if the files were downloaded by PrecisonProDB. default "".
  -U UNIPROT, --uniprot UNIPROT
                        uniprot protein sequences. If more than one file, use "," to join the files. default "". For example, "UP000005640_9606.fasta.gz", or
                        "UP000005640_9606.fasta.gz,UP000005640_9606_additional.fasta"
  --uniprot_min_len UNIPROT_MIN_LEN
                        minimum length required when matching uniprot sequences to proteins annotated in the genome. default 20
  --PEFF                If set, PEFF format file(s) will be generated. Default: do not generate PEFF file(s).

Notes

  • -p PROTEIN, --protein PROTEIN is a file with proteins matching the GTF file provided!
  • -k PROTEIN_KEYWORD is a keyword used to match the GTF file and the protein sequences. If not provided, the program will try to determine the keyword based on the datatype. The program needs the data to know the location of proteins in the genome, and codon matches to allow non-standard codons.
  • -a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}, --datatype {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf} should be set if you use the format above. For "gtf" format, the PROTEIN_KEYWORD and PROTEIN should match.

Outputs

For more information, visit the wiki page. https://github.com/ATPs/PrecisionProDB/wiki

Count number of changed proteins

The number of altered proteins will be shown during running PrecisonProDB. In the header line of "PREFIX.pergeno.protein_all.fa", a word "changed" or "unchanged" is at the end of the fasta header, and users may count the number of changed proteins based on this annotation.

Count number of changed amino acids (AAs)

Generally, users may found annotations for variants in the "PREFIX.pergeno.aa_mutations.csv" file. Users may get the effects of different variants including AA subsitutions, insertions, deletions, stop-loss, stop-gain, and frame-changes.

Further comparison

Users may use tools like https://github.com/pwilmart/fasta_utilities to further compare the difference of trypsin digested peptides.

Benchmark

Tested with a computing node with Intel Xeon CPU E5-2695 v4 @ 2.10GHz and 256GB memory, with GENCODE gene models and a variant file in text format from gnomAD 3.1 as input.

  • Depending on the available resources, a thread of 8 to 12 is recommendded.
  • If the variant file is in text format, typical running time will be 15 to 20 minutes
  • If the variant file is in VCF format, typical running time will be 30 to 40 minutes.

CPU/Memory consumption with 8 threads

Running time and required memory with different threads

PrecisionProDB_references

The Genome Aggregation Database (gnomAD) project, provide variant allele frequencies in different populations based on genomes and exomes of hundreds of thousands of individuals and this information can be integrated into a protein database. We applied PrecisionProDB to alleles from different populations from gnomAD 3.1 data. Results can be found at https://github.com/ATPs/PrecisionProDB_references.

Contact Information

Please leave comments on the issue tab.

About

generate personalized reference protein sequences for proteome search

License:GNU General Public License v3.0


Languages

Language:Python 100.0%